Skip to content
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/sql-migration-guide-upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,8 +109,8 @@ displayTitle: Spark SQL Upgrading Guide

- In version 2.3 and earlier, Spark converts Parquet Hive tables by default but ignores table properties like `TBLPROPERTIES (parquet.compression 'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES (orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, too. Since Spark 2.4, Spark respects Parquet/ORC specific table properties while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would be uncompressed parquet files.

- Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.

- Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. In addition, this makes Spark's Hive table read behavior more consistent over different formats and with the behavior of `spark.read.load`. For example, for both ORC/Parquet Hive tables, `LOCATION '/table/*'` is required instead of `LOCATION '/table/'` to create an external table reading its direct sub-directories like `'/table/dir'`. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
- In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`.

- Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.statistics.parallelFileListingInStatsComputation.enabled` to `False`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,82 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll {
}
}

protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this test helper function is only used in HiveOrcSourceSuite, can we move this into HiveOrcSourceSuite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I moved.

withTempDir { dir =>
val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", "c3").repartition(1)
withTable("tbl1", "tbl2", "tbl3", "tbl4") {
val dataDir = s"${dir.getCanonicalPath}/l3/l2/l1/"
val parentDir = s"${dir.getCanonicalPath}/l3/l2/"
val l3Dir = s"${dir.getCanonicalPath}/l3/"
val wildcardParentDir = new File(s"${dir}/l3/l2/*").toURI
val wildcardL3Dir = new File(s"${dir}/l3/*").toURI
someDF1.write.orc(dataDir)
val parentDirStatement =
s"""
|CREATE EXTERNAL TABLE tbl1(
| c1 int,
| c2 int,
| c3 string)
|STORED AS orc
|LOCATION '${parentDir}'""".stripMargin
sql(parentDirStatement)
val parentDirSqlStatement = s"select * from tbl1"
if (isConvertMetastore) {
checkAnswer(sql(parentDirSqlStatement), Nil)
} else {
checkAnswer(sql(parentDirSqlStatement),
(1 to 2).map(i => Row(i, i, s"orc$i")))
}

val l3DirStatement =
s"""
|CREATE EXTERNAL TABLE tbl2(
| c1 int,
| c2 int,
| c3 string)
|STORED AS orc
|LOCATION '${l3Dir}'""".stripMargin
sql(l3DirStatement)
val l3DirSqlStatement = s"select * from tbl2"
if (isConvertMetastore) {
checkAnswer(sql(l3DirSqlStatement), Nil)
} else {
checkAnswer(sql(l3DirSqlStatement),
(1 to 2).map(i => Row(i, i, s"orc$i")))
}

val wildcardStatement =
s"""
|CREATE EXTERNAL TABLE tbl3(
| c1 int,
| c2 int,
| c3 string)
|STORED AS orc
|LOCATION '$wildcardParentDir'""".stripMargin
sql(wildcardStatement)
val wildcardSqlStatement = s"select * from tbl3"
if (isConvertMetastore) {
checkAnswer(sql(wildcardSqlStatement),
(1 to 2).map(i => Row(i, i, s"orc$i")))
} else {
checkAnswer(sql(wildcardSqlStatement), Nil)
}

val wildcardL3Statement =
s"""
|CREATE EXTERNAL TABLE tbl4(
| c1 int,
| c2 int,
| c3 string)
|STORED AS orc
|LOCATION '$wildcardL3Dir'""".stripMargin
sql(wildcardL3Statement)
val wildcardL3SqlStatement = s"select * from tbl4"
checkAnswer(sql(wildcardL3SqlStatement), Nil)
}
}
}

test("create temporary orc table") {
checkAnswer(sql("SELECT COUNT(*) FROM normal_orc_source"), Row(10))

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2370,4 +2370,51 @@ class HiveDDLSuite
))
}
}

test("SPARK-25993 Add test cases for resolution of Parquet table location") {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, move to HiveParquetSourceSuite? That's the similar one with HiveOrcSourceSuite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's replace the test case name with CREATE EXTERNAL TABLE with subdirectories.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for the full test coverage, can we have the following combination like ORC, too?

Seq(true, false).foreach { convertMetastore =>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, changed.

withTempPath { path =>
val someDF1 = Seq((1, 1, "parq1"), (2, 2, "parq2")).toDF("c1", "c2", "c3").repartition(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

withTable("tbl1", "tbl2", "tbl3") {
val dataDir = s"${path.getCanonicalPath}/l3/l2/l1/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

val parentDir = s"${path.getCanonicalPath}/l3/l2/"
val l3Dir = s"${path.getCanonicalPath}/l3/"
val wildcardParentDir = new File(s"${path}/l3/l2/*").toURI
val wildcardL3Dir = new File(s"${path}/l3/*").toURI
someDF1.write.parquet(dataDir)
val parentDirStatement =
s"""
|CREATE EXTERNAL TABLE tbl1(
| c1 int,
| c2 int,
| c3 string)
|STORED AS parquet
|LOCATION '${parentDir}'""".stripMargin
sql(parentDirStatement)
checkAnswer(sql("select * from tbl1"), Nil)

val wildcardStatement =
s"""
|CREATE EXTERNAL TABLE tbl2(
| c1 int,
| c2 int,
| c3 string)
|STORED AS parquet
|LOCATION '${wildcardParentDir}'""".stripMargin
sql(wildcardStatement)
checkAnswer(sql("select * from tbl2"),
(1 to 2).map(i => Row(i, i, s"parq$i")))

val wildcardL3Statement =
s"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

|CREATE EXTERNAL TABLE tbl3(
| c1 int,
| c2 int,
| c3 string)
|STORED AS parquet
|LOCATION '${wildcardL3Dir}'""".stripMargin
sql(wildcardL3Statement)
checkAnswer(sql("select * from tbl3"), Nil)
}
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -190,4 +190,12 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton {
}
}
}

test("SPARK-25993 Add test cases for resolution of ORC table location") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this to CREATE EXTERNAL TABLE with subdirectories, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

Seq(true, false).foreach { convertMetastore =>
withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> s"$convertMetastore") {
testORCTableLocation(convertMetastore)
}
}
}
}