Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below:

val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS())
df.show()

Before

java.lang.NullPointerException
	at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89)
	at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
...

After

+---+----+--------+
|  a|   b|unparsed|
+---+----+--------+
|  a|null|       a|
+---+----+--------+

It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this.

How was this patch tested?

Unit test added in CSVSuite.scala.

@SparkQA
Copy link

SparkQA commented Sep 4, 2017

Test build #81368 has finished for PR 19113 at commit fa7eb51.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Any performance measure from 2.2 to 2.5?

@gatorsmile
Copy link
Member

How about the other popular open source projects? Do you know whether which projects are using Univocity 2.5?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Sep 4, 2017

With 2.7GB data, I ran a simple Java problem with 2.5.4 and 2.2.1 with CsvParser, and simple e2e read tests. Elapsed time diff was roughly -1.7% ~ +1.2%. I think virtually no diff (or ~0.25% improvement).

I think we generally trust other communities and libraries we decided to add such as ORC, Parquet, Jackson and etc., and de-duplicate such efforts with the community support. I think we discussed about a similar issue before.

@gatorsmile
Copy link
Member

gatorsmile commented Sep 5, 2017

This release of Univocity was just out a few days ago. To me, this sound risky.

We normally do not upgrade it to the latest version. This is why we are not using Parquet 1.9.0. Instead, we asked Parquet community to release 1.8.2.

cc @rxin @marmbrus @cloud-fan

@gatorsmile
Copy link
Member

Since the expected release of our next version Spark 2.3 is the end of this year, we still can revert it back to 2.2.1 if we realize this release 2.5.4 introduces new bugs or performance regression.

I am fine to merge it now. Let @rxin @marmbrus @cloud-fan do the final confirm.

@srowen
Copy link
Member

srowen commented Sep 5, 2017

If we need 2.5.x for the fix, then we need 2.5.x. It's worth picking up an update if it solves a real problem. And if we're going to update minor versions, it's generally good practice to pick the latest maintenance release unless there's a specific reason not to. I don't think we have any general policy against using the latest version of something; on the contrary. Parquet is more critical and perhaps less reliable about maintaining the exact behavior, so maybe deserves more caution, but this change seems fine.

@cloud-fan
Copy link
Contributor

We didn't accept parquet 1.9.0 because it has a known performance regression, I think this one is fine, merging to master, thanks!

@asfgit asfgit closed this in 02a4386 Sep 5, 2017
@HyukjinKwon HyukjinKwon deleted the bump-up-univocity branch January 2, 2018 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants