Skip to content

Conversation

@gene-db
Copy link
Contributor

@gene-db gene-db commented Apr 27, 2024

What changes were proposed in this pull request?

This fixes how ColumnVector handles copying arrays when the vector has a dictionary and null values. The possible issues with the previous implementation:

  • An ArrayIndexOutOfBoundsException may be thrown when the ColumnVector has nulls and dictionaries. This is because the dictionary id for null entries might be invalid and should not be used for null entries.
  • Copying a ColumnarArray (which contains a ColumnVector) is incorrect, if it contains null entries. This is because copying a primitive array does not take into account the null entries, so all the null entries get lost.

Why are the changes needed?

These changes are needed to avoid ArrayIndexOutOfBoundsException and to produce correct results when copying ColumnarArray.

Does this PR introduce any user-facing change?

The only user facing changes are to fix existing errors and incorrect results.

How was this patch tested?

Added new unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Apr 27, 2024
Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.5!

@cloud-fan cloud-fan closed this in 76ce6b0 Apr 28, 2024
cloud-fan pushed a commit that referenced this pull request Apr 28, 2024
…th dictionary and nulls

This fixes how `ColumnVector` handles copying arrays when the vector has a dictionary and null values. The possible issues with the previous implementation:
- An `ArrayIndexOutOfBoundsException` may be thrown when the `ColumnVector` has nulls and dictionaries. This is because the dictionary id for `null` entries might be invalid and should not be used for `null` entries.
- Copying a `ColumnarArray` (which contains a `ColumnVector`) is incorrect, if it contains `null` entries. This is because copying a primitive array does not take into account the `null` entries, so all the null entries get lost.

These changes are needed to avoid `ArrayIndexOutOfBoundsException` and to produce correct results when copying `ColumnarArray`.

The only user facing changes are to fix existing errors and incorrect results.

Added new unit tests.

No.

Closes #46254 from gene-db/dictionary-nulls.

Authored-by: Gene Pang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 76ce6b0)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request May 5, 2024
…when nulls exist

### What changes were proposed in this pull request?

This is a followup to #46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy.

Primitive arrays are faster than object arrays and won't create unnecessary objects.

### Why are the changes needed?

This will improve performance and memory usage, when nulls are present in the `ColumnarArray`.

### Does this PR introduce _any_ user-facing change?

This is expected to be faster when copying `ColumnarArray`.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46372 from gene-db/primitive-nulls.

Authored-by: Gene Pang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request May 5, 2024
…when nulls exist

### What changes were proposed in this pull request?

This is a followup to #46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy.

Primitive arrays are faster than object arrays and won't create unnecessary objects.

### Why are the changes needed?

This will improve performance and memory usage, when nulls are present in the `ColumnarArray`.

### Does this PR introduce _any_ user-facing change?

This is expected to be faster when copying `ColumnarArray`.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46372 from gene-db/primitive-nulls.

Authored-by: Gene Pang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit bf2e254)
Signed-off-by: Wenchen Fan <[email protected]>
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @gene-db and @cloud-fan .

SPARK-48019 was filed with Affected Version: 3.4.3. I also verified that the test cases fail on branch-3.4. Since this is a correctness issue, let me backport this and the follow-up to branch-3.4

Screenshot 2024-09-04 at 13 51 34

dongjoon-hyun pushed a commit that referenced this pull request Sep 4, 2024
…th dictionary and nulls

This fixes how `ColumnVector` handles copying arrays when the vector has a dictionary and null values. The possible issues with the previous implementation:
- An `ArrayIndexOutOfBoundsException` may be thrown when the `ColumnVector` has nulls and dictionaries. This is because the dictionary id for `null` entries might be invalid and should not be used for `null` entries.
- Copying a `ColumnarArray` (which contains a `ColumnVector`) is incorrect, if it contains `null` entries. This is because copying a primitive array does not take into account the `null` entries, so all the null entries get lost.

These changes are needed to avoid `ArrayIndexOutOfBoundsException` and to produce correct results when copying `ColumnarArray`.

The only user facing changes are to fix existing errors and incorrect results.

Added new unit tests.

No.

Closes #46254 from gene-db/dictionary-nulls.

Authored-by: Gene Pang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 76ce6b0)
Signed-off-by: Wenchen Fan <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Sep 4, 2024
…when nulls exist

### What changes were proposed in this pull request?

This is a followup to #46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy.

Primitive arrays are faster than object arrays and won't create unnecessary objects.

### Why are the changes needed?

This will improve performance and memory usage, when nulls are present in the `ColumnarArray`.

### Does this PR introduce _any_ user-facing change?

This is expected to be faster when copying `ColumnarArray`.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46372 from gene-db/primitive-nulls.

Authored-by: Gene Pang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit bf2e254)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants