-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16355] [SPARK-16354] [SQL] Fix Bugs When LIMIT/TABLESAMPLE is Non-foldable, Zero or Negative #14034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16355] [SPARK-16354] [SQL] Fix Bugs When LIMIT/TABLESAMPLE is Non-foldable, Zero or Negative #14034
Changes from 6 commits
1255968
bdf4e56
3c402d3
5b36fbc
a2a828f
f600ba4
8fd72f6
1abdbb9
3036847
d135b77
028aa79
01137dc
0ebbdfe
dec5ad9
2e6f8d8
d66870b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -46,6 +46,20 @@ trait CheckAnalysis extends PredicateHelper { | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| }).length > 1 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| private def checkLimitClause(limitExpr: Expression): Unit = { | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if (!limitExpr.foldable) { | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| failAnalysis( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "The argument to the LIMIT clause must evaluate to a constant value. " + | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| s"Limit:${limitExpr.sql}") | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| val numRows = limitExpr.eval().asInstanceOf[Int] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| assertEqual(s"$sql limit cast(9 / 4 as int)", plan.limit(Cast(Literal(9) / 4, IntegerType))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, but it's still foldable. Is it possible it's non-foldable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Oracle: http://docs.oracle.com/javadb/10.5.3.0/ref/rrefsqljoffsetfetch.html
- DB2 z/OS: https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/com.ibm.db2z10.doc.sqlref/src/tpc/db2z_sql_fetchfirstclause.html
- MySQL: http://dev.mysql.com/doc/refman/5.7/en/select.html
- PostgreSQL: https://www.postgresql.org/docs/8.1/static/queries-limit.html
It sounds like nobody supports it. All of the mainstream DB vendors only support integer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not support non-foldable limit clauses.
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
Lines 67 to 89 in d063898
| object SpecialLimits extends Strategy { | |
| override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { | |
| case logical.ReturnAnswer(rootPlan) => rootPlan match { | |
| case logical.Limit(IntegerLiteral(limit), logical.Sort(order, true, child)) => | |
| execution.TakeOrderedAndProjectExec(limit, order, None, planLater(child)) :: Nil | |
| case logical.Limit( | |
| IntegerLiteral(limit), | |
| logical.Project(projectList, logical.Sort(order, true, child))) => | |
| execution.TakeOrderedAndProjectExec( | |
| limit, order, Some(projectList), planLater(child)) :: Nil | |
| case logical.Limit(IntegerLiteral(limit), child) => | |
| execution.CollectLimitExec(limit, planLater(child)) :: Nil | |
| case other => planLater(other) :: Nil | |
| } | |
| case logical.Limit(IntegerLiteral(limit), logical.Sort(order, true, child)) => | |
| execution.TakeOrderedAndProjectExec(limit, order, None, planLater(child)) :: Nil | |
| case logical.Limit( | |
| IntegerLiteral(limit), logical.Project(projectList, logical.Sort(order, true, child))) => | |
| execution.TakeOrderedAndProjectExec( | |
| limit, order, Some(projectList), planLater(child)) :: Nil | |
| case _ => Nil | |
| } | |
| } |
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
Lines 398 to 401 in d063898
| case logical.LocalLimit(IntegerLiteral(limit), child) => | |
| execution.LocalLimitExec(limit, planLater(child)) :: Nil | |
| case logical.GlobalLimit(IntegerLiteral(limit), child) => | |
| execution.GlobalLimitExec(limit, planLater(child)) :: Nil |
But,,, we do not issue an exception if users do it. Thus, the error we got is strange:
assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2)
+- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203]
+- LocalLimit (_nondeterministic#202 > 0.2)
+- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202]
+- LogicalRDD [key#11, value#12]
java.lang.AssertionError: assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2)
+- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203]
+- LocalLimit (_nondeterministic#202 > 0.2)
+- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202]
+- LogicalRDD [key#11, value#12]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me do it in this PR. Thank you for your review! : )
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -660,18 +660,39 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { | |
|
|
||
| test("limit") { | ||
| checkAnswer( | ||
| sql("SELECT * FROM testData LIMIT 10"), | ||
| sql("SELECT * FROM testData LIMIT 9 + 1"), | ||
| testData.take(10).toSeq) | ||
|
|
||
| checkAnswer( | ||
| sql("SELECT * FROM arrayData LIMIT 1"), | ||
| sql("SELECT * FROM arrayData LIMIT CAST(1 AS Integer)"), | ||
| arrayData.collect().take(1).map(Row.fromTuple).toSeq) | ||
|
|
||
| checkAnswer( | ||
| sql("SELECT * FROM mapData LIMIT 1"), | ||
| mapData.collect().take(1).map(Row.fromTuple).toSeq) | ||
| } | ||
|
|
||
| test("non-foldable expressions in LIMIT") { | ||
| val e = intercept[AnalysisException] { | ||
| sql("SELECT * FROM testData LIMIT key > 3") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what will happen if the type is wrong? e.g.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A good question! : ) Now, the exception we issued is not good: Let me fix it and throw a more reasonable exception: |
||
| }.getMessage | ||
| assert(e.contains("The argument to the LIMIT clause must evaluate to a constant value. " + | ||
| "Limit:(testdata.`key` > 3)")) | ||
| } | ||
|
|
||
| test("negative in LIMIT or TABLESAMPLE") { | ||
| val expected = "number_rows in limit clause must be equal to or greater than 0. number_rows:-1" | ||
| var e = intercept[AnalysisException] { | ||
| sql("SELECT * FROM testData TABLESAMPLE (-1 rows)") | ||
| }.getMessage | ||
| assert(e.contains(expected)) | ||
|
|
||
| e = intercept[AnalysisException] { | ||
| sql("SELECT * FROM testData LIMIT -1") | ||
| }.getMessage | ||
| assert(e.contains(expected)) | ||
| } | ||
|
|
||
| test("CTE feature") { | ||
| checkAnswer( | ||
| sql("with q1 as (select * from testData limit 10) select * from q1"), | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,10 +17,12 @@ | |
|
|
||
| package org.apache.spark.sql | ||
|
|
||
| import org.apache.spark.sql.catalyst.plans.logical.{GlobalLimit, Join, LocalLimit} | ||
| import org.apache.spark.sql.test.SharedSQLContext | ||
| import org.apache.spark.sql.types._ | ||
|
|
||
| class StatisticsSuite extends QueryTest with SharedSQLContext { | ||
| import testImplicits._ | ||
|
|
||
| test("SPARK-15392: DataFrame created from RDD should not be broadcasted") { | ||
| val rdd = sparkContext.range(1, 100).map(i => Row(i, i)) | ||
|
|
@@ -31,4 +33,46 @@ class StatisticsSuite extends QueryTest with SharedSQLContext { | |
| spark.sessionState.conf.autoBroadcastJoinThreshold) | ||
| } | ||
|
|
||
| test("estimates the size of limit") { | ||
| withTempTable("test") { | ||
| Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4)).toDF("k", "v") | ||
| .createOrReplaceTempView("test") | ||
| Seq((0, 1), (1, 24), (2, 48)).foreach { case (limit, expected) => | ||
| val df = sql(s"""SELECT * FROM test limit $limit""") | ||
|
|
||
| val sizesGlobalLimit = df.queryExecution.analyzed.collect { case g: GlobalLimit => | ||
| g.statistics.sizeInBytes | ||
| } | ||
| assert(sizesGlobalLimit.size === 1, s"Size wrong for:\n ${df.queryExecution}") | ||
| assert(sizesGlobalLimit.head === BigInt(expected), | ||
| s"expected exact size 24 for table 'test', got: ${sizesGlobalLimit.head}") | ||
|
||
|
|
||
| val sizesLocalLimit = df.queryExecution.analyzed.collect { case l: LocalLimit => | ||
| l.statistics.sizeInBytes | ||
| } | ||
| assert(sizesLocalLimit.size === 1, s"Size wrong for:\n ${df.queryExecution}") | ||
| assert(sizesLocalLimit.head === BigInt(expected), | ||
| s"expected exact size 24 for table 'test', got: ${sizesLocalLimit.head}") | ||
| } | ||
| } | ||
| } | ||
|
|
||
| test("estimates the size of a limit 0 on outer join") { | ||
| withTempTable("test") { | ||
| Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4)).toDF("k", "v") | ||
| .createOrReplaceTempView("test") | ||
| val df1 = spark.table("test") | ||
| val df2 = spark.table("test").limit(0) | ||
| val df = df1.join(df2, Seq("k"), "left") | ||
|
|
||
| val sizes = df.queryExecution.analyzed.collect { case g: Join => | ||
| g.statistics.sizeInBytes | ||
| } | ||
|
|
||
| assert(sizes.size === 1, s"Size wrong for:\n ${df.queryExecution}") | ||
|
||
| assert(sizes.head === BigInt(96), | ||
| s"expected exact size 96 for table 'test', got: ${sizes.head}") | ||
| } | ||
| } | ||
|
|
||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it must be foldable, can we just use
intas limit instead of expression?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying we should change the SQL parser? You know, if so, we also need to change the
TABLESAMPLE n ROWS.FYI, even Hive does not support foldable expressions.
I found that Impala supports it.
http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_limit.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm saying maybe we can declare Limit as
case class GlobalLimit(limit: Int, child: LogicalPlan), but it's not a small change, we could think about it later.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh, I see. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may be more readable: