doc the change in migration-guide

apache · kevinyu98 · Nov 21, 2018 · Nov 21, 2018 · Nov 25, 2018 · Nov 26, 2018
commit e238764f278883b05d4bc88243facf897d357e84
diff --git a/docs/sql-migration-guide-upgrade.md b/docs/sql-migration-guide-upgrade.md
@@ -111,6 +111,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
 
+  - In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` default is `false`, if you specify a directory in the `LOCATION` clause in the `CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use the Hive ORC reader to read the data into the table if the directory or sub-directory contains the matching data, if you specify the wild card(*), the Hive ORC reader will not be able to read the data, because it is treating the wild card as a directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location '/tmp/orctab1/'` will read the data into the table, `create external table tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, `spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native ORC reader, it will read the data if you specify the wild card, but will not if you specify the parent directory. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
+
   - In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`.
 
   - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.statistics.parallelFileListingInStatsComputation.enabled` to `False`.

diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
@@ -626,7 +626,6 @@ abstract class OrcQueryTest extends OrcTest {
         checkAnswer(
           sql(s"SELECT * FROM ${tableName2}"),
           (1 to 2).map(i => Row(i, i, s"orc$i")))
-
       }
     }
   }
-Original file line number
+Diff line change
@@ Expand Up / @@ -626,7 +626,6 @@ abstract class OrcQueryTest extends OrcTest { @@
             checkAnswer(
               sql(s"SELECT * FROM ${tableName2}"),
               (1 to 2).map(i => Row(i, i, s"orc$i")))
           }
         }
       }
@@ Expand Down @@