Skip to content

Conversation

@AngersZhuuuu
Copy link
Contributor

What changes were proposed in this pull request?

For query

select array_union(array(cast('nan' as double), cast('nan' as double)), array())

This returns [NaN, NaN], but it should return [NaN].
This issue is caused by OpenHashSet can't handle Double.NaN and Float.NaN too.
In this pr we add a wrap for OpenHashSet that can handle null, Double.NaN, Float.NaN together

Why are the changes needed?

Fix bug

Does this PR introduce any user-facing change?

ArrayUnion won't show duplicated NaN value

How was this patch tested?

Added UT

@github-actions github-actions bot added the SQL label Sep 10, 2021
@AngersZhuuuu
Copy link
Contributor Author

ping @cloud-fan

@SparkQA
Copy link

SparkQA commented Sep 10, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47646/


private var containsNull = false
private var containsDoubleNaN = false
private var containsFloatNaN = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data added to this set will always be the same data type. I think we can just have a single containsNaN flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data added to this set will always be the same data type. I think we can just have a single containsNaN flag.

I have thought about this too, but since it can support any type, so keep this may be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should do the null/nan check at the caller side

class SQLOpenHashSet ... {
  def add(k: T)
  def addNull()
  def addNaN()
}

// caller side
if (row.isNullAt...) {
  set.addNull()
} else {
  ...
  if (java.lang.Double.isNaN(value)) {
    set.addNaN()
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about current

@SparkQA
Copy link

SparkQA commented Sep 10, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47656/

@SparkQA
Copy link

SparkQA commented Sep 10, 2021

Test build #143142 has finished for PR 33955 at commit 8579c97.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class SQLOpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](

@SparkQA
Copy link

SparkQA commented Sep 10, 2021

Test build #143152 has finished for PR 33955 at commit 1857988.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 11, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47669/

@SparkQA
Copy link

SparkQA commented Sep 11, 2021

Test build #143165 has finished for PR 33955 at commit 4e533fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

ping @cloud-fan

@SparkQA
Copy link

SparkQA commented Sep 13, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47685/

@SparkQA
Copy link

SparkQA commented Sep 13, 2021

Test build #143182 has finished for PR 33955 at commit 119679c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 13, 2021

Test build #143186 has finished for PR 33955 at commit 45d1fee.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 13, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47688/

@SparkQA
Copy link

SparkQA commented Sep 13, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47690/

@HyukjinKwon
Copy link
Member

Thanks guys.This is possibly flaky. I'll keep my eyes on the build.

@HyukjinKwon
Copy link
Member

I still see this test failure, see https://github.com/apache/spark/runs/3628995384. Shall we revert this PR?

@HyukjinKwon
Copy link
Member

Actually there are some more: https://github.com/apache/spark/runs/3619357249

@cloud-fan
Copy link
Contributor

This is so weird. There is no randomness in the test. How frequently do we see the test failure?

HyukjinKwon pushed a commit that referenced this pull request Sep 17, 2021
…lder in array functions

### What changes were proposed in this pull request?

In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element.

### Why are the changes needed?

Fix a potential bug. Somehow we can hit this bug sometimes after #33955 .

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #34029 from cloud-fan/minor.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Sep 17, 2021
…lder in array functions

### What changes were proposed in this pull request?

In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element.

### Why are the changes needed?

Fix a potential bug. Somehow we can hit this bug sometimes after #33955 .

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #34029 from cloud-fan/minor.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 4145498)
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Sep 17, 2021
…lder in array functions

### What changes were proposed in this pull request?

In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element.

### Why are the changes needed?

Fix a potential bug. Somehow we can hit this bug sometimes after #33955 .

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #34029 from cloud-fan/minor.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 4145498)
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Sep 17, 2021
…lder in array functions

In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element.

Fix a potential bug. Somehow we can hit this bug sometimes after #33955 .

No

existing tests

Closes #34029 from cloud-fan/minor.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 4145498)
Signed-off-by: Hyukjin Kwon <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 17, 2021
…at.Nan

### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value

### How was this patch tested?
Added UT

Closes #33993 from AngersZhuuuu/SPARK-36741.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 17, 2021
…at.Nan

### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value

### How was this patch tested?
Added UT

Closes #33993 from AngersZhuuuu/SPARK-36741.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 17, 2021
…at.Nan

### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value

### How was this patch tested?
Added UT

Closes #33993 from AngersZhuuuu/SPARK-36741.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 17, 2021
…at.Nan

For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955

Fix bug

ArrayDistinct won't show duplicated `NaN` value

Added UT

Closes #33993 from AngersZhuuuu/SPARK-36741.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 20, 2021
…oat.NaN

### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value

### How was this patch tested?
Added UT

Closes #33995 from AngersZhuuuu/SPARK-36754.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 20, 2021
…oat.NaN

### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value

### How was this patch tested?
Added UT

Closes #33995 from AngersZhuuuu/SPARK-36754.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2fc7f2f)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 20, 2021
…oat.NaN

### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value

### How was this patch tested?
Added UT

Closes #33995 from AngersZhuuuu/SPARK-36754.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2fc7f2f)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 20, 2021
…oat.NaN

### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value

### How was this patch tested?
Added UT

Closes #33995 from AngersZhuuuu/SPARK-36754.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2fc7f2f)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 22, 2021
….NaN

### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value

### How was this patch tested?
Added UT

Closes #33994 from AngersZhuuuu/SPARK-36753.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 22, 2021
….NaN

### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value

### How was this patch tested?
Added UT

Closes #33994 from AngersZhuuuu/SPARK-36753.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit a7cbe69)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 22, 2021
….NaN

### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value

### How was this patch tested?
Added UT

Closes #33994 from AngersZhuuuu/SPARK-36753.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit a7cbe69)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 22, 2021
….NaN

### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value

### How was this patch tested?
Added UT

Closes #33994 from AngersZhuuuu/SPARK-36753.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit a7cbe69)
Signed-off-by: Wenchen Fan <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Sep 22, 2021
….NaN

### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on apache/spark#33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value

### How was this patch tested?
Added UT

Closes #33994 from AngersZhuuuu/SPARK-36753.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
### What changes were proposed in this pull request?
For query
```
select array_union(array(cast('nan' as double), cast('nan' as double)), array())
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr we add a wrap for OpenHashSet that can handle `null`, `Double.NaN`, `Float.NaN` together

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayUnion won't show duplicated `NaN` value

### How was this patch tested?
Added UT

Closes apache#33955 from AngersZhuuuu/SPARK-36702-WrapOpenHashSet.

Lead-authored-by: Angerszhuuuu <[email protected]>
Co-authored-by: AngersZhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit f71f377)
Signed-off-by: Wenchen Fan <[email protected]>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…and Float.NaN

### What changes were proposed in this pull request?
According to apache#33955 (comment) use normalized  NaN

### Why are the changes needed?
Use normalized NaN for duplicated NaN value

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Exiting UT

Closes apache#34003 from AngersZhuuuu/SPARK-36702-FOLLOWUP.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6380859)
Signed-off-by: Wenchen Fan <[email protected]>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…lder in array functions

### What changes were proposed in this pull request?

In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element.

### Why are the changes needed?

Fix a potential bug. Somehow we can hit this bug sometimes after apache#33955 .

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes apache#34029 from cloud-fan/minor.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 4145498)
Signed-off-by: Hyukjin Kwon <[email protected]>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…at.Nan

### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on apache#33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value

### How was this patch tested?
Added UT

Closes apache#33993 from AngersZhuuuu/SPARK-36741.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <[email protected]>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…oat.NaN

### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on apache#33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value

### How was this patch tested?
Added UT

Closes apache#33995 from AngersZhuuuu/SPARK-36754.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2fc7f2f)
Signed-off-by: Wenchen Fan <[email protected]>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
….NaN

### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on apache#33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value

### How was this patch tested?
Added UT

Closes apache#33994 from AngersZhuuuu/SPARK-36753.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit a7cbe69)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants