Maven generate thrift source code #2

wangyum · 2019-10-24T16:34:21Z

This PR makes maven generate thrift source code.

How to test:

Install thrift

# Mac OS
brew install [email protected]
export PATH="/usr/local/opt/[email protected]/bin:$PATH"

# CentOS 7
wget http://apache.fayea.com/thrift/0.9.3/thrift-0.9.3.tar.gz
tar -zxf thrift-0.9.3.tar.gz
cd thrift-0.9.3
./configure --with-lua=no
make && make install

Build thriftserver package

mvn package -DskipTests -pl sql/thriftserver -Pspark-thriftserver

juliuszsompolski · 2019-10-24T17:36:22Z

The only catch is that it points to the specific executable in <thriftExecutable>/usr/local/opt/[email protected]/bin/thrift</thriftExecutable>. It will be different on different platforms, e.g. I have thrift in /usr/bin/thrift on Ubuntu.
This https://stackoverflow.com/questions/52044010/how-to-use-thrift-generator-as-maven-dependencyhow-to-avoid-reference-to-exe-f suggests that it could be set up on the build machine via a ~/.m2/settings.xml property on the build machine, but that it's not something that can be picked up automatically...
I don't think Spark build has any other step where you have to set up something manually like that? Unless thrift compiler can be installed automatically, I think the generated code will have to be committed, or this change will break too many build envs...

juliuszsompolski · 2019-10-24T17:46:25Z

There's https://github.com/ccascone/mvn-thrift-compiler, but it's not maintained...
We could do what it did directly - commit Linux and OSX thrift binaries, but I think that will also not necessarily work on all platforms, as it still might have some libc or libstdc++ or it seems it also links libgcc_s dependencies that may slightly change depending on linux distribution:

$ ldd `which thrift`
        linux-vdso.so.1 =>  (0x00007ffed77f5000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ff97b0f1000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ff97aedb000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff97ab11000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ff97a808000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ff97b473000)

so the binaries might not work on all Linux distros...

…ft-maven-plugin

wangyum · 2019-10-25T02:37:38Z

Yes. We should not set thriftExecutable to a fixed value, users should add thrift to their PATH.

### What changes were proposed in this pull request? `org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite` failed lately. After had a look at the logs it just shows the following fact without any details: ``` Caused by: sbt.ForkMain$ForkError: sun.security.krb5.KrbException: Server not found in Kerberos database (7) - Server not found in Kerberos database ``` Since the issue is intermittent and not able to reproduce it we should add more debug information and wait for reproduction with the extended logs. ### Why are the changes needed? Failing test doesn't give enough debug information. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I've started the test manually and checked that such additional debug messages show up: ``` >>> KrbApReq: APOptions are 00000000 00000000 00000000 00000000 >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType Looking for keys for: kafka/localhostEXAMPLE.COM Added key: 17version: 0 Added key: 23version: 0 Added key: 16version: 0 Found unsupported keytype (3) for kafka/localhostEXAMPLE.COM >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType Using builtin default etypes for permitted_enctypes default etypes for permitted_enctypes: 17 16 23. >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType MemoryCache: add 1571936500/174770/16C565221B70AAB2BEFE31A83D13A2F4/client/localhostEXAMPLE.COM to client/localhostEXAMPLE.COM|kafka/localhostEXAMPLE.COM MemoryCache: Existing AuthList: #3: 1571936493/200803/8CD70D280B0862C5DA1FF901ECAD39FE/client/localhostEXAMPLE.COM #2: 1571936499/985009/BAD33290D079DD4E3579A8686EC326B7/client/localhostEXAMPLE.COM #1: 1571936499/995208/B76B9D78A9BE283AC78340157107FD40/client/localhostEXAMPLE.COM ``` Closes apache#26252 from gaborgsomogyi/SPARK-29580. Authored-by: Gabor Somogyi <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? fix the error caused by interval output in ExtractBenchmark ### Why are the changes needed? fix a bug in the test ```scala [info] Running case: cast to interval [error] Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot use interval type in the table schema.;; [error] OverwriteByExpression RelationV2[] noop-table, true, true [error] +- Project [(subtractdates(cast(cast(id#0L as timestamp) as date), -719162) + subtracttimestamps(cast(id#0L as timestamp), -30610249419876544)) AS ((CAST(CAST(id AS TIMESTAMP) AS DATE) - DATE '0001-01-01') + (CAST(id AS TIMESTAMP) - TIMESTAMP '1000-01-01 01:02:03.123456'))#2] [error] +- Range (1262304000, 1272304000, step=1, splits=Some(1)) [error] [error] at org.apache.spark.sql.catalyst.util.TypeUtils$.failWithIntervalType(TypeUtils.scala:106) [error] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$25(CheckAnalysis.scala:389) [error] at org.a ``` ### Does this PR introduce any user-facing change? no ### How was this patch tested? re-run benchmark Closes apache#27867 from yaooqinn/SPARK-31111. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? fix the error caused by interval output in ExtractBenchmark ### Why are the changes needed? fix a bug in the test ```scala [info] Running case: cast to interval [error] Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot use interval type in the table schema.;; [error] OverwriteByExpression RelationV2[] noop-table, true, true [error] +- Project [(subtractdates(cast(cast(id#0L as timestamp) as date), -719162) + subtracttimestamps(cast(id#0L as timestamp), -30610249419876544)) AS ((CAST(CAST(id AS TIMESTAMP) AS DATE) - DATE '0001-01-01') + (CAST(id AS TIMESTAMP) - TIMESTAMP '1000-01-01 01:02:03.123456'))#2] [error] +- Range (1262304000, 1272304000, step=1, splits=Some(1)) [error] [error] at org.apache.spark.sql.catalyst.util.TypeUtils$.failWithIntervalType(TypeUtils.scala:106) [error] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$25(CheckAnalysis.scala:389) [error] at org.a ``` ### Does this PR introduce any user-facing change? no ### How was this patch tested? re-run benchmark Closes apache#27867 from yaooqinn/SPARK-31111. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 2b46662) Signed-off-by: Wenchen Fan <[email protected]>

…chmarks ### What changes were proposed in this pull request? Replace `CAST(... AS TIMESTAMP` by `TIMESTAMP_SECONDS` in the following benchmarks: - ExtractBenchmark - DateTimeBenchmark - FilterPushdownBenchmark - InExpressionBenchmark ### Why are the changes needed? The benchmarks fail w/o the changes: ``` [info] Running benchmark: datetime +/- interval [info] Running case: date + interval(m) [error] Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`id` AS TIMESTAMP)' due to data type mismatch: cannot cast bigint to timestamp,you can enable the casting by setting spark.sql.legacy.allowCastNumericToTimestamp to true,but we strongly recommend using function TIMESTAMP_SECONDS/TIMESTAMP_MILLIS/TIMESTAMP_MICROS instead.; line 1 pos 5; [error] 'Project [(cast(cast(id#0L as timestamp) as date) + 1 months) AS (CAST(CAST(id AS TIMESTAMP) AS DATE) + INTERVAL '1 months')#2] [error] +- Range (0, 10000000, step=1, splits=Some(1)) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected benchmarks. Closes apache#28843 from MaxGekk/GuoPhilipse-31710-fix-compatibility-followup. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…e are foldable boolean types ### What changes were proposed in this pull request? Improve `SimplifyConditionals`. Simplify `If(cond, TrueLiteral, FalseLiteral)` to `cond`. Simplify `If(cond, FalseLiteral, TrueLiteral)` to `Not(cond)`. The use case is: ```sql create table t1 using parquet as select id from range(10); select if (id > 2, false, true) from t1; ``` Before this pr: ``` == Physical Plan == *(1) Project [if ((id#1L > 2)) false else true AS (IF((id > CAST(2 AS BIGINT)), false, true))#2] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` After this pr: ``` == Physical Plan == *(1) Project [(id#1L <= 2) AS (IF((id > CAST(2 AS BIGINT)), false, true))#2] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes apache#30849 from wangyum/SPARK-33798-2. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR intends to fix flaky GitHub Actions (GA) tests below in `transform.sql` (this flakiness does not seem to happen in the Jenkins tests): - https://github.com/apache/spark/runs/1592987501 - https://github.com/apache/spark/runs/1593196242 - https://github.com/apache/spark/runs/1595496305 - https://github.com/apache/spark/runs/1596309555 This is because the error message is different between test runs in GA (the error message seems to be truncated indeterministically) ,e.g., ``` # https://github.com/apache/spark/runs/1592987501 Expected "...h status 127. Error:[ /bin/bash: some_non_existent_command: command not found]", but got "...h status 127. Error:[]" Result did not match for query #2 # https://github.com/apache/spark/runs/1593196242 Expected "...istent_command: comm[and not found]", but got "...istent_command: comm[]" Result did not match for query #2 ``` The root cause of this indeterministic behaviour happening only in GA is not clear though, this test throws SparkException consistently even in GA. So, this PR proposes to make the test just check if it will be thrown when running it. This PR comes from the dongjoon-hyun comment: https://github.com/apache/spark/pull/29414/files#r547414513 ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes apache#30896 from maropu/SPARK-32106-FOLLOWUP. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…pendently in Scala 2.13 ### What changes were proposed in this pull request? Similar to SPARK-35532, the main change of this pr is add `scala-2.13` profile to external/kafka-0-10-sql/pom.xml, external/avro/pom.xml and sql/hive-thriftserver/pom.xml, the `scala-2.13` profile include dependency on `scala-parallel-collections_2.13`, then all(34) spark modules can maven test independently. ### Why are the changes needed? Ensure alll(34) spark modules can be maven test independently in Scala 2.13 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the GitHub Action Scala 2.13 job - Manual test： 1. Execute ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 ``` 2. maven test `external/kafka-0-10-sql` module ``` mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl external/kafka-0-10-sql ``` **before** ``` Discovery starting. Discovery completed in 857 milliseconds. Run starting. Expected test count is: 464 ... KafkaRelationSuiteV2: - explicit earliest to latest offsets - default starting and ending offsets - explicit offsets - default starting and ending offsets with headers - timestamp provided for starting and ending - timestamp provided for starting, offset provided for ending - timestamp provided for ending, offset provided for starting - timestamp provided for starting, ending not provided - timestamp provided for ending, starting not provided - global timestamp provided for starting and ending - no matched offset for timestamp - startingOffsets - preferences on offset related options - no matched offset for timestamp - endingOffsets *** RUN ABORTED *** java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217) ... Cause: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) ... ``` **After** ``` Run completed in 33 minutes, 51 seconds. Total number of tests run: 464 Suites: completed 31, aborted 0 Tests: succeeded 464, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` 3. maven test `external/avro` module ``` mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl external/avro ``` **before** ``` Discovery starting. Discovery completed in 2 seconds, 765 milliseconds. Run starting. Expected test count is: 255 AvroReadSchemaSuite: - append column at the end - hide column at the end - append column into middle - hide column in the middle - add a nested column at the end of the leaf struct column - add a nested column in the middle of the leaf struct column - add a nested column at the end of the middle struct column - add a nested column in the middle of the middle struct column - hide a nested column at the end of the leaf struct column - hide a nested column in the middle of the leaf struct column - hide a nested column at the end of the middle struct column - hide a nested column in the middle of the middle struct column *** RUN ABORTED *** java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217) ... Cause: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) ... ``` **After** ``` Run completed in 1 minute, 42 seconds. Total number of tests run: 255 Suites: completed 12, aborted 0 Tests: succeeded 255, failed 0, canceled 0, ignored 2, pending 0 All tests passed. ``` 4. maven test `sql/hive-thriftserver` module ``` mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl sql/hive-thriftserver ``` **before** ``` - union.sql *** FAILED *** "1 a 1 a 2 b 2 b" did not contain "Exception" Exception did not match for query #2 SELECT * FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t1), expected: 1 a 1 a 2 b 2 b, but got: java.sql.SQLException org.apache.hive.service.cli.HiveSQLException: Error running query: java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:38) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:324) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:229) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:229) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:224) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:238) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:178) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:323) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:389) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3719) at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2987) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3710) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3708) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2987) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:299) ... 16 more Caused by: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 40 more (ThriftServerQueryTestSuite.scala:209) ``` **After** ``` Run completed in 29 minutes, 17 seconds. Total number of tests run: 535 Suites: completed 20, aborted 0 Tests: succeeded 535, failed 0, canceled 0, ignored 17, pending 0 All tests passed. ``` Closes apache#32994 from LuciferYang/SPARK-35838. Authored-by: YangJie <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…iEnabled in 'cast string to date #2' ### What changes were proposed in this pull request? This PR fixes the test to make `CastWithAnsiOffSuite` properly respect `ansiEnabled` in `cast string to date #2` test by using `CastWithAnsiOffSuite.cast` instead of `Cast` expression. ### Why are the changes needed? To make the tests pass. Currently it fails when ANSI mode is on: https://github.com/apache/spark/runs/6786744647 ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested in my IDE. Closes apache#36802 from HyukjinKwon/SPARK-39321-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ly equivalent children in `RewriteDistinctAggregates` ### What changes were proposed in this pull request? In `RewriteDistinctAggregates`, when grouping aggregate expressions by function children, treat children that are semantically equivalent as the same. ### Why are the changes needed? This PR will reduce the number of projections in the Expand operator when there are multiple distinct aggregations with superficially different children. In some cases, it will eliminate the need for an Expand operator. Example: In the following query, the Expand operator creates 3\*n rows (where n is the number of incoming rows) because it has a projection for each of function children `b + 1`, `1 + b` and `c`. ``` create or replace temp view v1 as select * from values (1, 2, 3.0), (1, 3, 4.0), (2, 4, 2.5), (2, 3, 1.0) v1(a, b, c); select a, count(distinct b + 1), avg(distinct 1 + b) filter (where c > 0), sum(c) from v1 group by a; ``` The Expand operator has three projections (each producing a row for each incoming row): ``` [a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for regular aggregation) [a#87, (b#88 + 1), null, 1, null, null], <== projection #2 (for distinct aggregation of b + 1) [a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for distinct aggregation of 1 + b) ``` In reality, the Expand only needs one projection for `1 + b` and `b + 1`, because they are semantically equivalent. With the proposed change, the Expand operator's projections look like this: ``` [a#67, null, 0, null, UnscaledValue(c#69)], <== projection #1 (for regular aggregations) [a#67, (b#68 + 1), 1, (c#69 > 0.0), null]], <== projection #2 (for distinct aggregation on b + 1 and 1 + b) ``` With one less projection, Expand produces 2\*n rows instead of 3\*n rows, but still produces the correct result. In the case where all distinct aggregates have semantically equivalent children, the Expand operator is not needed at all. Benchmark code in the JIRA (SPARK-40382). Before the PR: ``` distinct aggregates: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ all semantically equivalent 14721 14859 195 5.7 175.5 1.0X some semantically equivalent 14569 14572 5 5.8 173.7 1.0X none semantically equivalent 14408 14488 113 5.8 171.8 1.0X ``` After the PR: ``` distinct aggregates: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ all semantically equivalent 3658 3692 49 22.9 43.6 1.0X some semantically equivalent 9124 9214 127 9.2 108.8 0.4X none semantically equivalent 14601 14777 250 5.7 174.1 0.3X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. Closes apache#37825 from bersprockets/rewritedistinct_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR introduces sasl retry count in RetryingBlockTransferor. ### Why are the changes needed? Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario: 1. SaslTimeoutException 2. IOException 3. SaslTimeoutException 4. IOException Even though IOException at #2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step #4. Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New test is added, courtesy of Mridul. Closes apache#39611 from tedyu/sasl-cnt. Authored-by: Ted Yu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

… throw internal error ### What changes were proposed in this pull request? This PR fixes the error messages and classes when Python UDFs are used in higher order functions. ### Why are the changes needed? To show the proper user-facing exceptions with error classes. ### Does this PR introduce _any_ user-facing change? Yes, previously it threw internal error such as: ```python from pyspark.sql.functions import transform, udf, col, array spark.range(1).select(transform(array("id"), lambda x: udf(lambda y: y)(x))).collect() ``` Before: ``` py4j.protocol.Py4JJavaError: An error occurred while calling o74.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 0.0 failed 1 times, most recent failure: Lost task 15.0 in stage 0.0 (TID 15) (ip-192-168-123-103.ap-northeast-2.compute.internal executor driver): org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: <lambda>(lambda x_0#3L)#2 SQLSTATE: XX000 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) ``` After: ``` pyspark.errors.exceptions.captured.AnalysisException: [INVALID_LAMBDA_FUNCTION_CALL.UNEVALUABLE] Invalid lambda function call. Python UDFs should be used in a lambda function at a higher order function. However, "<lambda>(lambda x_0#3L)" was a Python UDF. SQLSTATE: 42K0D; Project [transform(array(id#0L), lambdafunction(<lambda>(lambda x_0#3L)#2, lambda x_0#3L, false)) AS transform(array(id), lambdafunction(<lambda>(lambda x_0#3L), namedlambdavariable()))#4] +- Range (0, 1, step=1, splits=Some(16)) ``` ### How was this patch tested? Unittest was added ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47079 from HyukjinKwon/SPARK-48706. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Kent Yao <[email protected]>

Maven generate thrift source code

fc2648f

wangyum mentioned this pull request Oct 24, 2019

[WIP][SPARK-29108][SQL] Add new module sql/thriftserver and add v11 thrift protocol apache/spark#26221

Closed

org.apache.thrift.tools:maven-thrift-plugin -> org.apache.thrift:thri…

5365dcf

…ft-maven-plugin

AngersZhuuuu merged commit 4dc5c7e into AngersZhuuuu:SPARK-29018-V11 Oct 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Maven generate thrift source code #2

Maven generate thrift source code #2

Uh oh!

wangyum commented Oct 24, 2019 •

edited

Loading

Uh oh!

juliuszsompolski commented Oct 24, 2019

Uh oh!

juliuszsompolski commented Oct 24, 2019

Uh oh!

wangyum commented Oct 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Maven generate thrift source code #2

Maven generate thrift source code #2

Uh oh!

Conversation

wangyum commented Oct 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliuszsompolski commented Oct 24, 2019

Uh oh!

juliuszsompolski commented Oct 24, 2019

Uh oh!

wangyum commented Oct 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wangyum commented Oct 24, 2019 •

edited

Loading