Skip to content
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/sql-migration-guide-upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,8 @@ displayTitle: Spark SQL Upgrading Guide

- Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.

- In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` default is `false`, if you specify a directory in the `LOCATION` clause in the `CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use the Hive ORC reader to read the data into the table if the directory or sub-directory contains the matching data, if you specify the wild card(*), the Hive ORC reader will not be able to read the data, because it is treating the wild card as a directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location '/tmp/orctab1/'` will read the data into the table, `create external table tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, `spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native ORC reader, it will read the data if you specify the wild card, but will not if you specify the parent directory. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change but will not if you specify the parent directory more clearly with examples like the other sentence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.


- In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`.

- Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.statistics.parallelFileListingInStatsComputation.enabled` to `False`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -597,6 +597,38 @@ abstract class OrcQueryTest extends OrcTest {
assert(m4.contains("Malformed ORC file"))
}
}

test("SPARK-25993 Add test cases for resolution of ORC table location") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveOrcSourceSuite.scala will be the better place. And, we had better have the following and cover both case behaviors; true and false.

    Seq(true, false).foreach { convertMetastore =>
      withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> s"$convertMetastore") {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will move the test case to there. Thanks.

withTempDir { dir =>
val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", "c3").repartition(1)
val tableName1 = "orcTable1"
val tableName2 = "orcTable2"
withTable(tableName1, tableName2) {
val path1 = s"${dir.getCanonicalPath}/dir1/"
someDF1.write.orc(path1)
val path2 = s"${dir.getCanonicalPath}/"
val sqlStatement1 =
s"""
|CREATE EXTERNAL TABLE $tableName1(C1 INT, C2 INT, C3 STRING)
|STORED AS ORC LOCATION '${path2}'
""".stripMargin
sql(sqlStatement1)
checkAnswer(
sql(s"SELECT * FROM ${tableName1}"), Nil)

val path3 = s"${dir.getCanonicalPath}/*"
val sqlStatement2 =
s"""
|CREATE EXTERNAL TABLE $tableName2(C1 INT, C2 INT, C3 STRING)
|STORED AS ORC LOCATION '${path3}'
""".stripMargin
sql(sqlStatement2)
checkAnswer(
sql(s"SELECT * FROM ${tableName2}"),
(1 to 2).map(i => Row(i, i, s"orc$i")))
}
}
}
}

class OrcQuerySuite extends OrcQueryTest with SharedSQLContext {
Expand Down