Skip to content

Conversation

@gcmerz
Copy link

@gcmerz gcmerz commented May 27, 2019

What changes were proposed in this pull request?

This PR aims to fix an issue on a union avro type with more than one non-null value (for instance ["string", "null", "int"]) whose the deserialization to a DataFrame would throw a java.lang.ArrayIndexOutOfBoundsException. The issue was that the fieldWriter relied on the index from the avro schema before nulls were filtered out.

How was this patch tested?

A test for the case of multiple non-null values was added and the tests were run using sbt by running testOnly org.apache.spark.sql.avro.AvroSuite

@gcmerz gcmerz changed the title Fix for avro deserialization on union types with multiple non-null types [External/Avro] Fix for avro deserialization on union types with multiple non-null types May 27, 2019
@dongjoon-hyun
Copy link
Member

Thank you for your first contribution, @gcmerz !

@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented May 28, 2019

Test build #105850 has finished for PR 24722 at commit ee97960.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [External/Avro] Fix for avro deserialization on union types with multiple non-null types [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types May 28, 2019
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two minor comments on the test case. Could you update them? Also, I updated the PR title and description a little bit.

@dongjoon-hyun
Copy link
Member

cc @dbtsai and @gengliangwang

@gcmerz
Copy link
Author

gcmerz commented May 28, 2019

Applied the tweaks--thank you so much for the quick review!

@dongjoon-hyun
Copy link
Member

You're welcome. Thank you for swift update.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@SparkQA
Copy link

SparkQA commented May 28, 2019

Test build #105855 has finished for PR 24722 at commit 4edbe09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Merged to master and branch-2.4.

dongjoon-hyun pushed a commit that referenced this pull request May 28, 2019
…ultiple non-null types

## What changes were proposed in this pull request?

This PR aims to fix an issue on a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) whose the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before nulls were filtered out.

## How was this patch tested?

A test for the case of multiple non-null values was added and the tests were run using sbt by running `testOnly org.apache.spark.sql.avro.AvroSuite`

Closes #24722 from gcmerz/master.

Authored-by: Gabbi Merz <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 29e154b)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 28, 2019

@gcmerz . What is your id in Apache JIRA? If you don't have, please create one. Then, I can assign that issue to you.

And, FYI, in GitHub personal setting, you can additionally add your Palantir email (used in this PR). Then, your commit with Palantir ID also will show your GitHub profile.

@gengliangwang
Copy link
Member

A late LGTM! Thanks for the fix @gcmerz

@gcmerz
Copy link
Author

gcmerz commented May 28, 2019

@dongjoon-hyun Made a Jira account with username "gcmerz" and added palantir email. Thanks again!

@dongjoon-hyun
Copy link
Member

It's assigned to you, @gcmerz . Welcome to the Apache Spark community!

@dongjoon-hyun
Copy link
Member

Thank you for review, @gengliangwang .

vinooganesh pushed a commit to palantir/spark that referenced this pull request May 28, 2019
…ultiple non-null types

## What changes were proposed in this pull request?

This PR aims to fix an issue on a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) whose the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before nulls were filtered out.

## How was this patch tested?

A test for the case of multiple non-null values was added and the tests were run using sbt by running `testOnly org.apache.spark.sql.avro.AvroSuite`

Closes apache#24722 from gcmerz/master.

Authored-by: Gabbi Merz <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
rluta pushed a commit to rluta/spark that referenced this pull request Sep 17, 2019
…ultiple non-null types

## What changes were proposed in this pull request?

This PR aims to fix an issue on a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) whose the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before nulls were filtered out.

## How was this patch tested?

A test for the case of multiple non-null values was added and the tests were run using sbt by running `testOnly org.apache.spark.sql.avro.AvroSuite`

Closes apache#24722 from gcmerz/master.

Authored-by: Gabbi Merz <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 29e154b)
Signed-off-by: Dongjoon Hyun <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Sep 26, 2019
…ultiple non-null types

## What changes were proposed in this pull request?

This PR aims to fix an issue on a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) whose the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before nulls were filtered out.

## How was this patch tested?

A test for the case of multiple non-null values was added and the tests were run using sbt by running `testOnly org.apache.spark.sql.avro.AvroSuite`

Closes apache#24722 from gcmerz/master.

Authored-by: Gabbi Merz <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 29e154b)
Signed-off-by: Dongjoon Hyun <[email protected]>
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…ultiple non-null types

## What changes were proposed in this pull request?

This PR aims to fix an issue on a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) whose the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before nulls were filtered out.

## How was this patch tested?

A test for the case of multiple non-null values was added and the tests were run using sbt by running `testOnly org.apache.spark.sql.avro.AvroSuite`

Closes apache#24722 from gcmerz/master.

Authored-by: Gabbi Merz <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

(cherry picked from commit 29e154b)

RB=2633258
G=spark-reviewers
R=ekrogen,wyzhang,wmoustaf
A=ekrogen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants