-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan #33955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan #33955
Conversation
|
ping @cloud-fan |
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
Outdated
Show resolved
Hide resolved
|
|
||
| private var containsNull = false | ||
| private var containsDoubleNaN = false | ||
| private var containsFloatNaN = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data added to this set will always be the same data type. I think we can just have a single containsNaN flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data added to this set will always be the same data type. I think we can just have a single
containsNaNflag.
I have thought about this too, but since it can support any type, so keep this may be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should do the null/nan check at the caller side
class SQLOpenHashSet ... {
def add(k: T)
def addNull()
def addNaN()
}
// caller side
if (row.isNullAt...) {
set.addNull()
} else {
...
if (java.lang.Double.isNaN(value)) {
set.addNaN()
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about current
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
Outdated
Show resolved
Hide resolved
|
Test build #143142 has finished for PR 33955 at commit
|
|
Test build #143152 has finished for PR 33955 at commit
|
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #143165 has finished for PR 33955 at commit
|
|
ping @cloud-fan |
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #143182 has finished for PR 33955 at commit
|
|
Test build #143186 has finished for PR 33955 at commit
|
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
…rsZhuuuu/spark into SPARK-36702-WrapOpenHashSet
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Outdated
Show resolved
Hide resolved
|
Thanks guys.This is possibly flaky. I'll keep my eyes on the build. |
|
I still see this test failure, see https://github.com/apache/spark/runs/3628995384. Shall we revert this PR? |
|
Actually there are some more: https://github.com/apache/spark/runs/3619357249 |
|
This is so weird. There is no randomness in the test. How frequently do we see the test failure? |
…lder in array functions ### What changes were proposed in this pull request? In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element. ### Why are the changes needed? Fix a potential bug. Somehow we can hit this bug sometimes after #33955 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #34029 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…lder in array functions ### What changes were proposed in this pull request? In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element. ### Why are the changes needed? Fix a potential bug. Somehow we can hit this bug sometimes after #33955 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #34029 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 4145498) Signed-off-by: Hyukjin Kwon <[email protected]>
…lder in array functions ### What changes were proposed in this pull request? In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element. ### Why are the changes needed? Fix a potential bug. Somehow we can hit this bug sometimes after #33955 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #34029 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 4145498) Signed-off-by: Hyukjin Kwon <[email protected]>
…lder in array functions In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element. Fix a potential bug. Somehow we can hit this bug sometimes after #33955 . No existing tests Closes #34029 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 4145498) Signed-off-by: Hyukjin Kwon <[email protected]>
…at.Nan
### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value
### How was this patch tested?
Added UT
Closes #33993 from AngersZhuuuu/SPARK-36741.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…at.Nan
### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value
### How was this patch tested?
Added UT
Closes #33993 from AngersZhuuuu/SPARK-36741.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <[email protected]>
…at.Nan
### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value
### How was this patch tested?
Added UT
Closes #33993 from AngersZhuuuu/SPARK-36741.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <[email protected]>
…at.Nan
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
Fix bug
ArrayDistinct won't show duplicated `NaN` value
Added UT
Closes #33993 from AngersZhuuuu/SPARK-36741.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <[email protected]>
…oat.NaN
### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value
### How was this patch tested?
Added UT
Closes #33995 from AngersZhuuuu/SPARK-36754.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…oat.NaN
### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value
### How was this patch tested?
Added UT
Closes #33995 from AngersZhuuuu/SPARK-36754.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2fc7f2f)
Signed-off-by: Wenchen Fan <[email protected]>
…oat.NaN
### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value
### How was this patch tested?
Added UT
Closes #33995 from AngersZhuuuu/SPARK-36754.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2fc7f2f)
Signed-off-by: Wenchen Fan <[email protected]>
…oat.NaN
### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value
### How was this patch tested?
Added UT
Closes #33995 from AngersZhuuuu/SPARK-36754.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2fc7f2f)
Signed-off-by: Wenchen Fan <[email protected]>
….NaN
### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value
### How was this patch tested?
Added UT
Closes #33994 from AngersZhuuuu/SPARK-36753.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
….NaN
### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value
### How was this patch tested?
Added UT
Closes #33994 from AngersZhuuuu/SPARK-36753.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit a7cbe69)
Signed-off-by: Wenchen Fan <[email protected]>
….NaN
### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value
### How was this patch tested?
Added UT
Closes #33994 from AngersZhuuuu/SPARK-36753.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit a7cbe69)
Signed-off-by: Wenchen Fan <[email protected]>
….NaN
### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value
### How was this patch tested?
Added UT
Closes #33994 from AngersZhuuuu/SPARK-36753.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit a7cbe69)
Signed-off-by: Wenchen Fan <[email protected]>
….NaN
### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on apache/spark#33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value
### How was this patch tested?
Added UT
Closes #33994 from AngersZhuuuu/SPARK-36753.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
For query
```
select array_union(array(cast('nan' as double), cast('nan' as double)), array())
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr we add a wrap for OpenHashSet that can handle `null`, `Double.NaN`, `Float.NaN` together
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayUnion won't show duplicated `NaN` value
### How was this patch tested?
Added UT
Closes apache#33955 from AngersZhuuuu/SPARK-36702-WrapOpenHashSet.
Lead-authored-by: Angerszhuuuu <[email protected]>
Co-authored-by: AngersZhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit f71f377)
Signed-off-by: Wenchen Fan <[email protected]>
…and Float.NaN ### What changes were proposed in this pull request? According to apache#33955 (comment) use normalized NaN ### Why are the changes needed? Use normalized NaN for duplicated NaN value ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exiting UT Closes apache#34003 from AngersZhuuuu/SPARK-36702-FOLLOWUP. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 6380859) Signed-off-by: Wenchen Fan <[email protected]>
…lder in array functions ### What changes were proposed in this pull request? In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element. ### Why are the changes needed? Fix a potential bug. Somehow we can hit this bug sometimes after apache#33955 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes apache#34029 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 4145498) Signed-off-by: Hyukjin Kwon <[email protected]>
…at.Nan
### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on apache#33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value
### How was this patch tested?
Added UT
Closes apache#33993 from AngersZhuuuu/SPARK-36741.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <[email protected]>
…oat.NaN
### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on apache#33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value
### How was this patch tested?
Added UT
Closes apache#33995 from AngersZhuuuu/SPARK-36754.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2fc7f2f)
Signed-off-by: Wenchen Fan <[email protected]>
….NaN
### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on apache#33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value
### How was this patch tested?
Added UT
Closes apache#33994 from AngersZhuuuu/SPARK-36753.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit a7cbe69)
Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
For query
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by
OpenHashSetcan't handleDouble.NaNandFloat.NaNtoo.In this pr we add a wrap for OpenHashSet that can handle
null,Double.NaN,Float.NaNtogetherWhy are the changes needed?
Fix bug
Does this PR introduce any user-facing change?
ArrayUnion won't show duplicated
NaNvalueHow was this patch tested?
Added UT