[SPARK-20978][SQL] Bump up Univocity version to 2.5.4 #19113

HyukjinKwon · 2017-09-04T01:25:00Z

What changes were proposed in this pull request?

There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below:

val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS())
df.show()

Before

java.lang.NullPointerException
	at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89)
	at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
...

After

+---+----+--------+
|  a|   b|unparsed|
+---+----+--------+
|  a|null|       a|
+---+----+--------+

It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this.

How was this patch tested?

Unit test added in CSVSuite.scala.

SparkQA · 2017-09-04T04:39:00Z

Test build #81368 has finished for PR 19113 at commit fa7eb51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-04T05:03:40Z

Any performance measure from 2.2 to 2.5?

gatorsmile · 2017-09-04T05:04:33Z

How about the other popular open source projects? Do you know whether which projects are using Univocity 2.5?

HyukjinKwon · 2017-09-04T12:20:48Z

With 2.7GB data, I ran a simple Java problem with 2.5.4 and 2.2.1 with CsvParser, and simple e2e read tests. Elapsed time diff was roughly -1.7% ~ +1.2%. I think virtually no diff (or ~0.25% improvement).

I think we generally trust other communities and libraries we decided to add such as ORC, Parquet, Jackson and etc., and de-duplicate such efforts with the community support. I think we discussed about a similar issue before.

gatorsmile · 2017-09-05T06:48:14Z

This release of Univocity was just out a few days ago. To me, this sound risky.

We normally do not upgrade it to the latest version. This is why we are not using Parquet 1.9.0. Instead, we asked Parquet community to release 1.8.2.

cc @rxin @marmbrus @cloud-fan

gatorsmile · 2017-09-05T07:01:31Z

Since the expected release of our next version Spark 2.3 is the end of this year, we still can revert it back to 2.2.1 if we realize this release 2.5.4 introduces new bugs or performance regression.

I am fine to merge it now. Let @rxin @marmbrus @cloud-fan do the final confirm.

srowen · 2017-09-05T09:02:34Z

If we need 2.5.x for the fix, then we need 2.5.x. It's worth picking up an update if it solves a real problem. And if we're going to update minor versions, it's generally good practice to pick the latest maintenance release unless there's a specific reason not to. I don't think we have any general policy against using the latest version of something; on the contrary. Parquet is more critical and perhaps less reliable about maintaining the exact behavior, so maybe deserves more caution, but this change seems fine.

cloud-fan · 2017-09-05T15:22:02Z

We didn't accept parquet 1.9.0 because it has a known performance regression, I think this one is fine, merging to master, thanks!

Bump up Univocity version to 2.5.4

fa7eb51

srowen approved these changes Sep 4, 2017

View reviewed changes

asfgit closed this in 02a4386 Sep 5, 2017

HyukjinKwon deleted the bump-up-univocity branch January 2, 2018 03:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-20978][SQL] Bump up Univocity version to 2.5.4 #19113

[SPARK-20978][SQL] Bump up Univocity version to 2.5.4 #19113

Uh oh!

HyukjinKwon commented Sep 4, 2017

Uh oh!

SparkQA commented Sep 4, 2017

Uh oh!

gatorsmile commented Sep 4, 2017

Uh oh!

gatorsmile commented Sep 4, 2017

Uh oh!

HyukjinKwon commented Sep 4, 2017 •

edited

Loading

Uh oh!

gatorsmile commented Sep 5, 2017 •

edited

Loading

Uh oh!

gatorsmile commented Sep 5, 2017

Uh oh!

srowen commented Sep 5, 2017

Uh oh!

cloud-fan commented Sep 5, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-20978][SQL] Bump up Univocity version to 2.5.4 #19113

[SPARK-20978][SQL] Bump up Univocity version to 2.5.4 #19113

Uh oh!

Conversation

HyukjinKwon commented Sep 4, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 4, 2017

Uh oh!

gatorsmile commented Sep 4, 2017

Uh oh!

gatorsmile commented Sep 4, 2017

Uh oh!

HyukjinKwon commented Sep 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Sep 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Sep 5, 2017

Uh oh!

srowen commented Sep 5, 2017

Uh oh!

cloud-fan commented Sep 5, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Sep 4, 2017 •

edited

Loading

gatorsmile commented Sep 5, 2017 •

edited

Loading