[SPARK-24676][SQL] Project required data from CSV parsed data when column pruning disabled #21657

maropu · 2018-06-28T07:50:06Z

What changes were proposed in this pull request?

This pr modified code to project required data from CSV parsed data when column pruning disabled.
In the current master, an exception below happens if spark.sql.csv.parser.columnPruning.enabled is false. This is because required formats and CSV parsed formats are different from each other;

./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false
scala> val dir = "/tmp/spark-csv/csv"
scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer
        at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
        ...

How was this patch tested?

Added tests in CSVSuite.

maropu · 2018-06-28T07:50:27Z

@HyukjinKwon @MaxGekk

SparkQA · 2018-06-28T11:34:21Z

Test build #92418 has finished for PR 21657 at commit ecd3c80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk

sgtm

MaxGekk · 2018-06-28T14:30:27Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

Can you use something else instead of avg? I would eliminate equality checking for floating point operands. You can easily get 4.49999999999 instead of 4.5.

MaxGekk · 2018-06-28T14:32:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

It seems the bug is not related to the column pruning feature, and most likely it presented in previous versions. Should it be ported to the 2.3 branch?

IIUC, before the column pruning pr, UnivocityParser always required columns only. So, we don't need to project them in CSVFileFormat. cc: @HyukjinKwon

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

Line 50 in e3de6ab

private val row = new GenericInternalRow(requiredSchema.length)

I think we better just merge this into the master since we already added some changes related with column pruning stuff. Let me double check it before merging it in.

MaxGekk · 2018-06-28T14:36:01Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

The string exprs and selectExpr can be replaced by .select(sum('p), avg('c0))

I think both is ok. We have preferred one?

I think both are fine~

MaxGekk · 2018-06-28T14:36:29Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

"true" -> true ?

MaxGekk · 2018-06-28T14:37:08Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

Just in case, if required schema is empty, the fix works too?

Sure, I added the test: https://github.com/apache/spark/pull/21657/files#diff-219ac8201e443435499123f96e94d29fR1592

SparkQA · 2018-06-29T05:26:00Z

Test build #92445 has finished for PR 21657 at commit 4f192a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-06-29T20:30:01Z

Actually the issue happened because I removed the mapping: 64fad0b#diff-d19881aceddcaa5c60620fdcda99b4c4L79

I would propose to revert it back, and remove all those "expensive" (comparing to look up tokenIndexArr) maps and the projection: https://github.com/apache/spark/pull/21657/files#diff-a549ac2e19ee7486911e2e6403444d9dR156

maropu · 2018-07-02T07:38:21Z

@MaxGekk Do you mean we remove the option for column pruning in csv?

MaxGekk · 2018-07-02T09:18:47Z

Do you mean we remove the option for column pruning in csv?

I mean reverting back the index mapping - tokenIndexArr. In this case, your changes in buildReader are not needed.

If spark.sql.csv.parser.columnPruning.enabledis set to false, the tokenIndexArr will map data index to required index. In the default case (spark.sql.csv.parser.columnPruning.enabled is true), the array is actually not needed, and it can be replaced by the identity function.

maropu · 2018-07-02T11:46:03Z

Not sure though, the tokenIndexArr implementation is always faster than the unsafe projection?

MaxGekk

Taking into account that spark.sql.csv.parser.columnPruning.enabled is set to true by default. I think performance of another path is not so matter. For me both solutions are ok.

maropu · 2018-07-02T12:36:47Z

@HyukjinKwon WDYT?

maropu · 2018-07-03T13:23:26Z

@HyukjinKwon kindly ping

HyukjinKwon · 2018-07-03T16:20:40Z

Eh .. actually can we revive 64fad0b#diff-d19881aceddcaa5c60620fdcda99b4c4L79 ? This sounds safer to me.

maropu · 2018-07-03T23:49:32Z

ok, will revive the code here

SparkQA · 2018-07-04T06:10:56Z

Test build #92590 has finished for PR 21657 at commit fc2108e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-07-04T12:36:21Z

@HyukjinKwon @MaxGekk plz check?

MaxGekk · 2018-07-04T20:22:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

It seems the code here and in the line https://github.com/apache/spark/pull/21657/files#diff-d19881aceddcaa5c60620fdcda99b4c4R51 above is the same. Could you fold the lines.

MaxGekk · 2018-07-04T20:22:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

I would apply this small optimization: java.lang.Integer.valueOf(dataSchema.indexOf(f))

MaxGekk · 2018-07-04T21:56:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

The lazy for tokenIndexArr means that internal laziness flag will be check per each token. I would remove lazy for tokenIndexArr

SparkQA · 2018-07-05T03:24:09Z

Test build #92629 has finished for PR 21657 at commit af620cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-07-05T09:41:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

Could you convert it to Array explicitly. I have checked type of tokenIndexArr, it is actually scala.collection.immutable.$colon$colon - lazy list.

Any side-effect?

You have O(n) instead of O(1) for getting a value from the collection by an index.

ah, I see. I'll recheck. Thanks!

maropu · 2018-07-06T06:36:57Z

cc: @HyukjinKwon

SparkQA · 2018-07-07T07:05:01Z

Test build #92703 has finished for PR 21657 at commit d5921f0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-07-07T07:10:20Z

retest this please

SparkQA · 2018-07-07T10:20:17Z

Test build #92704 has finished for PR 21657 at commit d5921f0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-07-07T10:37:01Z

retest this please

SparkQA · 2018-07-07T14:13:50Z

Test build #92706 has finished for PR 21657 at commit d5921f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-09T15:00:22Z

@MaxGekk, does this look good to you in general?

MaxGekk · 2018-07-09T15:10:36Z

@HyukjinKwon yes

MaxGekk · 2018-07-09T15:21:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala


+  // This index is used to reorder parsed tokens
+  private val tokenIndexArr =
+    requiredSchema.map(f => java.lang.Integer.valueOf(dataSchema.indexOf(f))).toArray


Just in case, we can do an optimization by memory here. The array is used under the flag options.columnPruning only. We can create an empty array (or null) if options.columnPruning is set to false.

This array is used in both cases: line 56 ( options.columnPruning=true) and line 208 (options.columnPruning=false)?

gatorsmile · 2018-07-09T22:54:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

    val options: CSVOptions) extends Logging {
  require(requiredSchema.toSet.subsetOf(dataSchema.toSet),
-    "requiredSchema should be the subset of schema.")
+    "requiredSchema should be the subset of dataSchema.")


Nit: generally, we should consider printing out the schemas.

SparkQA · 2018-07-11T04:41:30Z

Test build #92834 has finished for PR 21657 at commit dd5bb59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-07-11T05:03:33Z

@HyukjinKwon ping

MaxGekk · 2018-07-11T07:02:52Z

@HyukjinKwon @gatorsmile Would you mind to merge the PR?

HyukjinKwon · 2018-07-11T07:05:18Z

Please let me do one pass within few days.

gatorsmile · 2018-07-14T23:23:08Z

ok to test

gatorsmile · 2018-07-14T23:23:16Z

retest this please

gatorsmile

Thanks for finding and fixing this issue!

LGTM except a few comments.

gatorsmile · 2018-07-14T23:40:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala


 class UnivocityParser(
    dataSchema: StructType,
    requiredSchema: StructType,


Could you add the parameter descriptions of dataSchema and requiredSchema above class UnivocityParser?

gatorsmile · 2018-07-14T23:56:53Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+        val dir = path.getAbsolutePath
+        spark.range(10).selectExpr("id % 2 AS p", "id AS c0", "id AS c1").write.partitionBy("p")
+          .option("header", "true").csv(dir)
+        var df = spark.read.option("header", true).csv(dir).selectExpr("sum(p)", "count(c0)")


Normally, we do not use var for DataFrame even in test cases

gatorsmile · 2018-07-15T00:15:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+
  val tokenizer = {
    val parserSetting = options.asParserSettings
    if (options.columnPruning && requiredSchema.length < dataSchema.length) {


Can be simplified to

// When to-be-parsed schema is shorter than the to-be-read data schema, we let Univocity CSV parser select a sequence of fields for reading by their positions.
if (parsedSchema.length < dataSchema.length)

gatorsmile · 2018-07-15T00:18:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

-  private val schema = if (options.columnPruning) requiredSchema else dataSchema

-  private val row = new GenericInternalRow(schema.length)
+  private val parsedSchema = if (options.columnPruning) requiredSchema else dataSchema


Add a comment like

// When column pruning is enabled, the parser only parses the required columns based on their positions in the data schema.

gatorsmile · 2018-07-15T00:26:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

        var i = 0
-        while (i < schema.length) {
-          row(i) = valueConverters(i).apply(tokens(i))
+        while (i < requiredSchema.length) {


Add the comment like

// When the length of the returned tokens is identical to the length of the parsed schema, we just need to convert the tokens that correspond to the required columns.

gatorsmile · 2018-07-15T00:28:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+
  private def convert(tokens: Array[String]): InternalRow = {
-    if (tokens.length != schema.length) {
+    if (tokens.length != parsedSchema.length) {


If possible, could you add a test case that satisfy tokens.length != parsedSchema.length

SparkQA · 2018-07-15T03:06:11Z

Test build #93010 has finished for PR 21657 at commit dd5bb59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM too

SparkQA · 2018-07-15T10:53:31Z

Test build #93015 has finished for PR 21657 at commit 81b3971.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-16T03:21:34Z

Thanks! Merged to master

MaxGekk reviewed Jun 28, 2018

View reviewed changes

MaxGekk approved these changes Jul 2, 2018

View reviewed changes

MaxGekk reviewed Jul 4, 2018

View reviewed changes

MaxGekk reviewed Jul 5, 2018

View reviewed changes

maropu added 5 commits July 7, 2018 11:15

Fix

33671c8

Fix

83ed082

Fix

2d11e5f

Fix

12906d8

Fix

d5921f0

maropu force-pushed the SPARK-24676 branch from af620cb to d5921f0 Compare July 7, 2018 05:14

MaxGekk reviewed Jul 9, 2018

View reviewed changes

gatorsmile reviewed Jul 9, 2018

View reviewed changes

Fix

dd5bb59

gatorsmile reviewed Jul 15, 2018

View reviewed changes

Fix

6722fba

HyukjinKwon approved these changes Jul 15, 2018

View reviewed changes

Fix

81b3971

asfgit closed this in d463533 Jul 16, 2018

[SPARK-24676][SQL] Project required data from CSV parsed data when column pruning disabled #21657

[SPARK-24676][SQL] Project required data from CSV parsed data when column pruning disabled #21657

Uh oh!

Conversation

maropu commented Jun 28, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

maropu commented Jun 28, 2018

Uh oh!

SparkQA commented Jun 28, 2018

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 29, 2018

Uh oh!

MaxGekk commented Jun 29, 2018

Uh oh!

maropu commented Jul 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Jul 2, 2018

Uh oh!

maropu commented Jul 2, 2018

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

maropu commented Jul 2, 2018

Uh oh!

maropu commented Jul 3, 2018

Uh oh!

HyukjinKwon commented Jul 3, 2018

Uh oh!

maropu commented Jul 3, 2018

Uh oh!

SparkQA commented Jul 4, 2018

Uh oh!

maropu commented Jul 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 5, 2018

Uh oh!

Choose a reason for hiding this comment

maropu commented Jul 2, 2018 •

edited

Loading

HyukjinKwon commented Jul 9, 2018 •

edited

Loading

maropu Jul 11, 2018 •

edited

Loading

gatorsmile Jul 9, 2018 •

edited

Loading