Skip to content
Prev Previous commit
Next Next commit
retuen null for empty input
  • Loading branch information
zhengruifeng committed Feb 14, 2017
commit 9a8fc1e5d141e00c8775855e9f9cd3f07f7905d6
Original file line number Diff line number Diff line change
Expand Up @@ -71,8 +71,12 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
col: String,
probabilities: Array[Double],
relativeError: Double): Array[Double] = {
StatFunctions.multipleApproxQuantiles(df.select(col).na.drop(),
Seq(col), probabilities, relativeError).head.toArray
val res = approxQuantile(Array(col), probabilities, relativeError)
if (res != null) {
res.head
} else {
null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Scaladoc should describe this case

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above five lines can be shorten to Option(res).map(_.head).orNull

}

/**
Expand All @@ -96,8 +100,12 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
cols: Array[String],
probabilities: Array[Double],
relativeError: Double): Array[Array[Double]] = {
StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols,
probabilities, relativeError).map(_.toArray).toArray
try {
StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will drop the whole row if any column has null or NaN. For example,

    Seq[(java.lang.Long, java.lang.Double)]((null, 1.23), (3L, null), (4L, 3.45))
      .toDF("a", "b").na.drop().show()

That means, users could get different results. It depends on which API they used.

df.stat.approxQuantile("col1", Array(q1), epsilon)
df.stat.approxQuantile("col2", Array(q1), epsilon)
df.stat.approxQuantile(Array("col1", "col2"), Array(q1), epsilon)

I am wondering if this is the expected behavior?

Copy link
Member

@gatorsmile gatorsmile Feb 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not sound right to me. cc @holdenk @MLnick @jkbradley

I think that might be the reason why we did not provide such an API at the beginning.

If we want to make them consistent, it does not get any performance benefit from this API. Should we revert the PR #12135 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Good catch! Agree that this will cause confusing results.
I think there are two way to make them consistent:
1, The behavior of na-droping was included in SPARK-17219 to enhanced NaN value handling, and the single-column version of approxQuantile is only used in QuantileDiscretizer. So we can make the na-droping happen in QuantileDiscretizer, and remove the na-drop in approxQuantile.
2, modify the impl StatFunctions.multipleApproxQuantiles to deal with null and NaN, and remove the na-drop in approxQuantile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhengruifeng Sure. If we want to make them consistent, I am fine. How about reverting #12135 at first? At the same time, we can work on the new solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally there was never any na dropping in approxQuantile as far as I can recall. That was added in #14858. cc @srowen

You could also simply change the na dropping to only drop from the cols passed as args for each version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! I vote for modifying multipleApproxQuantiles to handle null and NaN values. As far as reverting, I'm OK either way as long as we get the fix into 2.2. I'd actually recommend going ahead and merging this PR and creating a follow-up Critical Bug targeted at 2.2.

@MLnick I think dropping NAs from the cols passed as args still will not work. Say the user passes cols "a" and "b" as args, but some rows have (a = NaN, b = 1.0). Then those rows will be ignored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us add a TODO comment above this function and create a JIRA for tracking this issue. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, I missed that!

Ok, so #14858 added the na dropping to approxQuantile. It wasn't there in the original impl. It can work on data that has NaN in the sense that it will return NaNs as part of the quantiles, in some cases. I'm not sure if that was the intended behavior in the original impl or not cc @thunterdb.

In that sense it's the same as aggs like min or max - those don't exclude NaN automatically, it's up to the user to handle them. So it could be best to just remove the na dropping from approxQuantile (which would revert to original impl behavior) and put it in QuantileDiscretizer which was the original need.

probabilities, relativeError).map(_.toArray).toArray
} catch {
case e: NoSuchElementException => null
}
}


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -173,10 +173,13 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
assert(e2.getMessage.contains("Relative Error must be non-negative"))

// dataset should be non-empty
intercept[NoSuchElementException] {
df.selectExpr("*").limit(0)
.stat.approxQuantile(Array("singles", "doubles"), Array(q1, q2), epsilons.head)
}
val res1 = df.selectExpr("*").limit(0)
.stat.approxQuantile("singles", Array(q1, q2), epsilons.head)
assert(res1 === null)

val res2 = df.selectExpr("*").limit(0)
.stat.approxQuantile(Array("singles", "doubles"), Array(q1, q2), epsilons.head)
assert(res2 === null)
}

test("approximate quantile 2: test relativeError greater than 1 return the same result as 1") {
Expand Down