Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jan 9, 2018

What changes were proposed in this pull request?

The reader schema is said to be evolved (or projected) when it changed after the data is written. The followings are already supported in file-based data sources. Note that partition columns are not maintained in files. In this PR, column means non-partition column.

  1. Add a column
  2. Hide a column
  3. Change a column position
  4. Change a column type (upcast)

This issue aims to guarantee users a backward-compatible read-schema test coverage on file-based data sources and to prevent future regressions by adding read schema tests explicitly.

Here, we consider safe changes without data loss. For example, data type change should be from small types to larger types like int-to-long, not vice versa.

As of today, in the master branch, file-based data sources have the following coverage.

File Format Coverage Note
TEXT N/A Schema consists of a single string column.
CSV 1, 2, 4
JSON 1, 2, 3, 4
ORC 1, 2, 3, 4 Native vectorized ORC reader has the widest coverage among ORC formats.
PARQUET 1, 2, 3

How was this patch tested?

Pass the Jenkins with newly added test suites.

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile , @cloud-fan , @HyukjinKwon , @viirya .
Could you review this PR?

@SparkQA
Copy link

SparkQA commented Jan 9, 2018

Test build #85865 has finished for PR 20208 at commit 499801e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext
  • trait AddColumnEvolutionTest extends SchemaEvolutionTest
  • trait RemoveColumnEvolutionTest extends SchemaEvolutionTest
  • trait ChangePositionEvolutionTest extends SchemaEvolutionTest
  • trait BooleanTypeEvolutionTest extends SchemaEvolutionTest
  • trait IntegralTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest

@dongjoon-hyun
Copy link
Member Author

Also, ping @sameeragarwal , too.

@dongjoon-hyun
Copy link
Member Author

Also, ping @rxin , too.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85983 has finished for PR 20208 at commit 499801e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext
  • trait AddColumnEvolutionTest extends SchemaEvolutionTest
  • trait RemoveColumnEvolutionTest extends SchemaEvolutionTest
  • trait ChangePositionEvolutionTest extends SchemaEvolutionTest
  • trait BooleanTypeEvolutionTest extends SchemaEvolutionTest
  • trait IntegralTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@gatorsmile
Copy link
Member

We are working on the Spark 2.3 release. Could you ping us after the release?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jan 13, 2018

Thank you for review, @gatorsmile .
Is there any concern about shaking 2.3 release? This is a test case only PR to build a clear consensus since Apache Spark 2.3.0. I think it's safe to be part of 2.3.0.

@gatorsmile
Copy link
Member

Do not have enough review bandwidth on this test-only PRs before Spark 2.3 release

@SparkQA
Copy link

SparkQA commented Jan 13, 2018

Test build #86099 has finished for PR 20208 at commit 499801e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext
  • trait AddColumnEvolutionTest extends SchemaEvolutionTest
  • trait RemoveColumnEvolutionTest extends SchemaEvolutionTest
  • trait ChangePositionEvolutionTest extends SchemaEvolutionTest
  • trait BooleanTypeEvolutionTest extends SchemaEvolutionTest
  • trait IntegralTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jan 16, 2018

Test build #86144 has finished for PR 20208 at commit 499801e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext
  • trait AddColumnEvolutionTest extends SchemaEvolutionTest
  • trait RemoveColumnEvolutionTest extends SchemaEvolutionTest
  • trait ChangePositionEvolutionTest extends SchemaEvolutionTest
  • trait BooleanTypeEvolutionTest extends SchemaEvolutionTest
  • trait IntegralTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86259 has finished for PR 20208 at commit 22eb772.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext
  • trait AddColumnEvolutionTest extends SchemaEvolutionTest
  • trait RemoveColumnEvolutionTest extends SchemaEvolutionTest
  • trait ChangePositionEvolutionTest extends SchemaEvolutionTest
  • trait BooleanTypeEvolutionTest extends SchemaEvolutionTest
  • trait IntegralTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86281 has finished for PR 20208 at commit 22eb772.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext
  • trait AddColumnEvolutionTest extends SchemaEvolutionTest
  • trait RemoveColumnEvolutionTest extends SchemaEvolutionTest
  • trait ChangePositionEvolutionTest extends SchemaEvolutionTest
  • trait BooleanTypeEvolutionTest extends SchemaEvolutionTest
  • trait IntegralTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: `byte` -> byte or the opposite for consistency with the same instances.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here seems many tests have some duplicated codes .. can we maybe do such as something like as below?

Seq(byteDF, ...).zip("byte").foreach { case (df, t) =>
  test(s"boolean to $t") {
    spark.read
      .schema("col1 long")
      .format(format)
      .options(options)
      .load(path)
    checkAnswer(df4, longDF)
  }
}

I am fine with any idea to deal with this duplication.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jan 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, for this, when we put the variables (byteDF, ...) outside of test functions, it seems to cause SQLContext errors. Never mind. I handled that as lazy variables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do withTempPath.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we leave the number given above in this comment like (case 1.).

@HyukjinKwon
Copy link
Member

cc @sameeragarwal for reviewing too. I vaguely remember we had a talk about this before.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @HyukjinKwon . I'll update like that.

@SparkQA
Copy link

SparkQA commented Jan 20, 2018

Test build #86413 has finished for PR 20208 at commit e1d6f2a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2018

Test build #86414 has finished for PR 20208 at commit 29c281d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile , @cloud-fan , @sameeragarwal , @HyukjinKwon .
The PR is ready for review again. Spark commit log seems to be a little quiet since yesterday.
Could you squeeze some time to give for this Schema Evolution suite? Thank you in advance for any advice!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun, how do we guarantee schema change in Parquet and ORC?

I thought we (roughly) randomly pick up a file, read its footer and then use it. So, I was thinking we don't properly support this. It makes sense to Parquet with mergeSchema tho.

I think it's not even guaranteed in CSV too because we will rely on its header from one file.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jan 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, and this is not about schema merging.
The final correct schema is given by users (or Hive).
In this PR, all schema is given by users, but for Hive tables, we uses the Hive Metastore Schema.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohaaa, the schema is explicitly set here. Sorry, I missed it.

@gatorsmile
Copy link
Member

Will do it after 2.3 release

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jan 24, 2018

Test build #86599 has finished for PR 20208 at commit 29c281d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86603 has finished for PR 20208 at commit 29c281d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jan 26, 2018

Hi, @rxin , @cloud-fan , @sameeragarwal , @HyukjinKwon .

Could you give me some opinions about this PR? I know that Xiao Li is busy for this period, so I didn't ping him at this time. For me, this PR is important. Sorry for being annoying you guys.

@dongjoon-hyun
Copy link
Member Author

I'll update like the followings.

  • Remove Remove a column part from the description parts (docs/testsuite file doc) while keeping the test cases.
  • Add a clear description about partition columns position rules.
  • Mention upcast for Change a column type part.

For docs/sql-programming-guide.md, I'll keep during review period.

@SparkQA
Copy link

SparkQA commented May 22, 2018

Test build #90919 has finished for PR 20208 at commit e136bc3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext
  • trait AddColumnEvolutionTest extends SchemaEvolutionTest
  • trait HideColumnAtTheEndEvolutionTest extends SchemaEvolutionTest
  • trait HideColumnInTheMiddleEvolutionTest extends SchemaEvolutionTest
  • trait ChangePositionEvolutionTest extends SchemaEvolutionTest
  • trait BooleanTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToStringTypeEvolutionTest extends SchemaEvolutionTest
  • trait IntegralTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest

@SparkQA
Copy link

SparkQA commented May 22, 2018

Test build #90920 has finished for PR 20208 at commit ea9047a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext
  • trait AddColumnEvolutionTest extends SchemaEvolutionTest
  • trait HideColumnAtTheEndEvolutionTest extends SchemaEvolutionTest
  • trait HideColumnInTheMiddleEvolutionTest extends SchemaEvolutionTest
  • trait ChangePositionEvolutionTest extends SchemaEvolutionTest
  • trait BooleanTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToStringTypeEvolutionTest extends SchemaEvolutionTest
  • trait IntegralTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest

@dongjoon-hyun
Copy link
Member Author

Sorry for the delay. I updated the PR according to the comments, @gatorsmile .
Could you review this once more?

@SparkQA
Copy link

SparkQA commented Jun 11, 2018

Test build #91654 has finished for PR 20208 at commit a8026b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext
  • trait AddColumnEvolutionTest extends SchemaEvolutionTest
  • trait HideColumnAtTheEndEvolutionTest extends SchemaEvolutionTest
  • trait HideColumnInTheMiddleEvolutionTest extends SchemaEvolutionTest
  • trait ChangePositionEvolutionTest extends SchemaEvolutionTest
  • trait BooleanTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToStringTypeEvolutionTest extends SchemaEvolutionTest
  • trait IntegralTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest

@dongjoon-hyun
Copy link
Member Author

Rebased to the master.

@SparkQA
Copy link

SparkQA commented Jul 9, 2018

Test build #92731 has finished for PR 20208 at commit ebd239e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext
  • trait AddColumnEvolutionTest extends SchemaEvolutionTest
  • trait HideColumnAtTheEndEvolutionTest extends SchemaEvolutionTest
  • trait HideColumnInTheMiddleEvolutionTest extends SchemaEvolutionTest
  • trait ChangePositionEvolutionTest extends SchemaEvolutionTest
  • trait BooleanTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToStringTypeEvolutionTest extends SchemaEvolutionTest
  • trait IntegralTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 9, 2018

Test build #92744 has finished for PR 20208 at commit ebd239e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext
  • trait AddColumnEvolutionTest extends SchemaEvolutionTest
  • trait HideColumnAtTheEndEvolutionTest extends SchemaEvolutionTest
  • trait HideColumnInTheMiddleEvolutionTest extends SchemaEvolutionTest
  • trait ChangePositionEvolutionTest extends SchemaEvolutionTest
  • trait BooleanTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToStringTypeEvolutionTest extends SchemaEvolutionTest
  • trait IntegralTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest
  • trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest

when `path/to/table/gender=male` is the path of the data and
users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.

### Schema Evolution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still want to avoid using schema evolution in the doc or tests. Schema Projection might better. More importantly, you have to clarify that this only covers the read path.

What is the behavior in the write path when the physical and data schemas are different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @gatorsmile . I'll update like that.

For write operation, we cannot specify schema like read path, .schema. Spark writes the new file into the directory additionally or overwrites the directory.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based data sources [SPARK-23007][SQL][TEST] Add read schema suite for file-based data sources Jul 10, 2018
/**
* All file-based data sources supports column addition and removal at the end.
*/
abstract class ReadSchemaSuite
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile . Now, it becomes ReadSchemaSuite.

import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}

/**
* The reader schema is said to be evolved (or projected) when it changed after the data is
Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jul 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I clearly mentioned read schema and used evolved and projected as general verbs.

@dongjoon-hyun
Copy link
Member Author

@gatorsmile , @HyukjinKwon .
Could you review this again for Spark 2.4?

* | CSV | 1, 2, 4 | |
* | JSON | 1, 2, 3, 4 | |
* | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. |
* | PARQUET | 1, 2, 3 | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for helping improve the test coverage! All the included test cases are positive. How about the negative test cases? What kind of errors you hit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Right. Since the main purpose of this PR is preventing regressions, it consists of positive-only. The errors are case-by-case for each data sources.

For BooleanTypeTest example, Parquet raises higher exceptions due to ClassCastException (at the bottom). JSON raises Results do not match test case failures without exceptions.

  • Parquet
org.apache.spark.sql.execution.QueryExecutionException: Encounter error while reading parquet files. One possible cause: Parquet column cannot be converted in the corresponding files. Details: 
...
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file file:/private/var/folders/dc/1pz9m69x14q_gw8t7m143t1c0000gn/T/spark-4b3d788b-1d7e-4ca2-9c01-88f639daf02f/part-00000-975391e5-1f1d-49f5-8e12-3213281618ed-c000.snappy.parquet
...
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableByte cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableBoolean

@SparkQA
Copy link

SparkQA commented Jul 11, 2018

Test build #92828 has finished for PR 20208 at commit a7064ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 11, 2018

Test build #92827 has finished for PR 20208 at commit 767d7ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait ReadSchemaTest extends QueryTest with SQLTestUtils with SharedSQLContext
  • trait AddColumnTest extends ReadSchemaTest
  • trait HideColumnAtTheEndTest extends ReadSchemaTest
  • trait HideColumnInTheMiddleTest extends ReadSchemaTest
  • trait ChangePositionTest extends ReadSchemaTest
  • trait BooleanTypeTest extends ReadSchemaTest
  • trait ToStringTypeTest extends ReadSchemaTest
  • trait IntegralTypeTest extends ReadSchemaTest
  • trait ToDoubleTypeTest extends ReadSchemaTest
  • trait ToDecimalTypeTest extends ReadSchemaTest

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jul 12, 2018

The test suite is designed like the following according to their features.

class CSVReadSchemaSuite
  extends ReadSchemaSuite
  with IntegralTypeTest
  with ToDoubleTypeTest
  with ToDecimalTypeTest
  with ToStringTypeTest {

  override val format: String = "csv"
}

To add a negative test case, we need to do something like with NoBooleanTypeTest. How do you think about that? @gatorsmile

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@dongjoon-hyun
Copy link
Member Author

@gatorsmile . Please let me know if I need to do more.

@SparkQA
Copy link

SparkQA commented Jul 12, 2018

Test build #92939 has finished for PR 20208 at commit a7064ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@dongjoon-hyun This PR is to improve the test coverage. LGTM.

When the schema do not match with the schemas of underlying data source, the current error messages might be weird. This is a common issue, I think. Please submit a separate PR to improve the error handling in these cases?

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in 07704c9 Jul 12, 2018
@dongjoon-hyun
Copy link
Member Author

Thank you so much, @gatorsmile . Sure. I'll make a PR to improve error handling for that.

@dongjoon-hyun
Copy link
Member Author

Also, thank you, @HyukjinKwon .

@dongjoon-hyun dongjoon-hyun deleted the SPARK-SCHEMA-EVOLUTION branch July 13, 2018 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants