SPARK-22833 [Improvement] in SparkHive Scala example - change in comm…

…ents made
apache · chetkhatri · Dec 19, 2017 · Dec 19, 2017 · Dec 20, 2017 · Dec 20, 2017
commit 69a4145140b4b8980f3893b677985b49f2da12a7
diff --git a/examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala b/examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
@@ -102,58 +102,32 @@ object SparkHiveExample {
     // |  5| val_5|  5| val_5|
     // ...
 
-    /*
-     * Save DataFrame to Hive Managed table as Parquet format
-     * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default
-     * warehouse location will be used to store Hive table Data.
-     * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
-     * You don't have to explicitly give location for each table, every tables under specified schema will be located at
-     * location given while creating schema.
-     * 2. Create Hive Managed table with storage format as 'Parquet'
-     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
-     */
-    val hiveTableDF = sql("SELECT * FROM records").toDF()
+    // Save DataFrame to Hive Managed table as Parquet format
+    val hiveTableDF = sql("SELECT * FROM records")
     hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
 
-    /*
-     * Save DataFrame to Hive External table as compatible parquet format.
-     * 1. Create Hive External table with storage format as parquet.
-     * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET;
-     * Since we are not explicitly providing hive database location, it automatically takes default warehouse location
-     * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport().
-     * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories
-     * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'.
-     */
-
     // to make Hive parquet format compatible with spark parquet format
     spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
     // Multiple parquet files could be created accordingly to volume of data under directory given.
-    val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records"
+    val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records"
+
+    // Save DataFrame to Hive External table as compatible parquet format
     hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
 
     // turn on flag for Dynamic Partitioning
     spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
     spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
+
     // You can create partitions in Hive table, so downstream queries run much faster.
     hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
       .parquet(hiveExternalTableLocation)
-    /*
-    If Data volume is very huge, then every partitions would have many small-small files which may harm
-    downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O.
-    To improve performance you can create single parquet file under each partition directory using 'repartition'
-    on partitioned key for Hive table. When you add partition to table, there will be change in table DDL.
-    Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET;
-     */
+
+    // reduce number of files for each partition by repartition
     hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
       .partitionBy("key").parquet(hiveExternalTableLocation)
 
-    /*
-     You can also do coalesce to control number of files under each partitions, repartition does full shuffle and equal
-     data distribution to all partitions. here coalesce can reduce number of files to given 'Int' argument without
-     full data shuffle.
-     */
-    // coalesce of 10 could create 10 parquet files under each partitions,
-    // if data is huge and make sense to do partitioning.
+    // Control number of files in each partition by coalesce
     hiveTableDF.coalesce(10).write.mode(SaveMode.Overwrite)
       .partitionBy("key").parquet(hiveExternalTableLocation)
     // $example off:spark_hive$