Skip to content

Conversation

@GuoPhilipse
Copy link
Owner

sync

MaxGekk and others added 5 commits June 5, 2020 14:18
### What changes were proposed in this pull request?
Set `spark.sql.datetime.java8API.enabled` to `true` in:
1. `SparkSQLEnv.init()` of Thrift server, and
2. `SparkSQLSessionManager.openSession()`

### Why are the changes needed?
1. Date and timestamp string literals are parsed by using Java 8 time API and Spark's session time zone. Before the changes, date/timestamp values were collected as legacy types `java.sql.Date`/`java.sql.Timestamp`, and the value of such types didn't respect the config `spark.sql.session.timeZone`. To have consistent view, users had to keep JVM time zone and Spark's session time zone in sync.
2. After the changes, formatting of date values doesn't depend on JVM time zone.
3. While returning dates/timestamps of Java 8 type, we can avoid dates/timestamps rebasing from Proleptic Gregorian calendar to the hybrid calendar (Julian + Gregorian), and the issues related to calendar switching.
4. Properly handle negative years (BCE).
5. Consistent conversion of date/timestamp strings to/from internal Catalyst types in both direction to and from Spark.

### Does this PR introduce any user-facing change?
Yes. Before:
```sql
spark-sql> select make_date(-44, 3, 15);
0045-03-15
```
After:
```sql
spark-sql> select make_date(-44, 3, 15);
-0044-03-15
```

### How was this patch tested?
Manually via `bin/spark-sql`.

Closes #28729 from MaxGekk/enable-java8-time-api-in-thrift-server.

Lead-authored-by: Max Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…rmatting

### What changes were proposed in this pull request?

In this PR, we throw `SparkUpgradeException` when getting `DateTimeException` for datetime formatting in the `EXCEPTION` legacy Time Parser Policy.

### Why are the changes needed?
`DateTimeException` is also declared by `java.time.format.DateTimeFormatter#format`, but in Spark, it can barely occur. We have suspected one that due to a JDK bug so far. see https://bugs.openjdk.java.net/browse/JDK-8079628.

For `from_unixtime` function, we will suppress the DateTimeException caused by `DD` and result `NULL`. It is a silent date change that should be avoided in Java 8.

### Does this PR introduce _any_ user-facing change?

Yes,  when running on Java8 and using `from_unixtime` function with pattern `DD` to format datetimes, if dayofyear>=100, `SparkUpgradeException` will alert users instead of silently resulting null. For `date_format`, `SparkUpgradeException` take the palace of  `DateTimeException`.

### How was this patch tested?

add unit tests.

Closes #28736 from yaooqinn/SPARK-31867-F.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…tition columns

### What changes were proposed in this pull request?
```sql
CREATE TABLE t1(a STRING, B VARCHAR(10), C CHAR(10)) STORED AS parquet;
CREATE TABLE t2 USING parquet PARTITIONED BY (b, c) AS SELECT * FROM t1;
SELECT * FROM t2 WHERE b = 'A';
```
Above SQL throws MetaException

> Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:810)
	... 114 more
Caused by: MetaException(message:Filtering is supported only on partition keys of type string, or integral types)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$FilterBuilder.setError(ExpressionTree.java:184)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.getJdoFilterPushdownParam(ExpressionTree.java:439)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilterOverPartitions(ExpressionTree.java:356)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilter(ExpressionTree.java:278)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree.generateJDOFilterFragment(ExpressionTree.java:583)
	at org.apache.hadoop.hive.metastore.ObjectStore.makeQueryFilterString(ObjectStore.java:3315)
	at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsViaOrmFilter(ObjectStore.java:2768)
	at org.apache.hadoop.hive.metastore.ObjectStore.access$500(ObjectStore.java:182)
	at org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3248)
	at org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3232)
	at org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2974)
	at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilterInternal(ObjectStore.java:3250)
	at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:2906)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:101)
	at com.sun.proxy.$Proxy25.getPartitionsByFilter(Unknown Source)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:5093)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
	at com.sun.proxy.$Proxy26.get_partitions_by_filter(Unknown Source)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:1232)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
	at com.sun.proxy.$Proxy27.listPartitionsByFilter(Unknown Source)
	at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:2679)
	... 119 more

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add a unit test.

Closes #28724 from LantaoJin/SPARK-31904.

Authored-by: LantaoJin <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
### What changes were proposed in this pull request?
`fileNameOnly` parameter is split to 2 pieces in [this](dbb8143) commit. This PR re-unites it.

### Why are the changes needed?
Parameter description split in doc.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
```
cd docs/
SKIP_API=1 jekyll build
```
Manual webpage check.

Closes #28739 from gaborgsomogyi/datasettxtfix.

Authored-by: Gabor Somogyi <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?
* Pagination Support is added to all tables of streaming page in spark web UI.
For adding pagination support, existing classes from #7399 were used.
* Earlier streaming page has two tables `Active Batches` and `Completed Batches`. Now, we will have three tables `Running Batches`, `Waiting Batches` and `Completed Batches`. If we have large number of waiting and running batches then keeping track in a single table is difficult. Also other pages have different table for different type type of data.
* Earlier empty tables were shown. Now only non-empty tables will be shown.
`Active Batches` table used to show details of waiting batches followed by running batches.

### Why are the changes needed?
Pagination will allow users to analyse the table in much better way. All spark web UI pages support pagination apart from streaming pages, so this will add consistency as well. Also it might fix the potential OOM errors that can arise.

### Does this PR introduce _any_ user-facing change?
Yes. `Active Batches` table is split into two tables `Running Batches` and `Waiting Batches`. Pagination Support is added to the all the tables. Every other functionality is unchanged.

### How was this patch tested?
Manually.

Before changes:
<img width="1667" alt="Screenshot 2020-05-03 at 7 07 14 PM" src="https://user-images.githubusercontent.com/15366835/80915680-8fb44b80-8d71-11ea-9957-c4a3769b8b67.png">

After Changes:
<img width="1669" alt="Screenshot 2020-05-03 at 6 51 22 PM" src="https://user-images.githubusercontent.com/15366835/80915694-a9ee2980-8d71-11ea-8fc5-246413a4951d.png">

Closes #28439 from iRakson/streamingPagination.

Authored-by: iRakson <[email protected]>
Signed-off-by: Kousuke Saruta <[email protected]>
@GuoPhilipse GuoPhilipse merged commit 664277e into GuoPhilipse:master Jun 7, 2020
GuoPhilipse pushed a commit that referenced this pull request Jun 10, 2020
### What changes were proposed in this pull request?

This PR proposes to make `PythonFunction` holds `Seq[Byte]` instead of `Array[Byte]` to be able to compare if the byte array has the same values for the cache manager.

### Why are the changes needed?

Currently the cache manager doesn't use the cache for `udf` if the `udf` is created again even if the functions is the same.

```py
>>> func = lambda x: x

>>> df = spark.range(1)
>>> df.select(udf(func)("id")).cache()
```
```py
>>> df.select(udf(func)("id")).explain()
== Physical Plan ==
*(2) Project [pythonUDF0#14 AS <lambda>(id)#12]
+- BatchEvalPython [<lambda>(id#0L)], [pythonUDF0#14]
 +- *(1) Range (0, 1, step=1, splits=12)
```

This is because `PythonFunction` holds `Array[Byte]`, and `equals` method of array equals only when the both array is the same instance.

### Does this PR introduce _any_ user-facing change?

Yes, if the user reuse the Python function for the UDF, the cache manager will detect the same function and use the cache for it.

### How was this patch tested?

I added a test case and manually.

```py
>>> df.select(udf(func)("id")).explain()
== Physical Plan ==
InMemoryTableScan [<lambda>(id)#12]
   +- InMemoryRelation [<lambda>(id)#12], StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *(2) Project [pythonUDF0#5 AS <lambda>(id)#3]
            +- BatchEvalPython [<lambda>(id#0L)], [pythonUDF0#5]
               +- *(1) Range (0, 1, step=1, splits=12)
```

Closes apache#28774 from ueshin/issues/SPARK-31945/udf_cache.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants