Skip to content

Commit 965b3bb

Browse files
committed
[SPARK-9148] [SPARK-10252] [SQL] Update SQL Programming Guide
Author: Michael Armbrust <[email protected]> Closes #8441 from marmbrus/documentation. (cherry picked from commit dc86a22) Signed-off-by: Michael Armbrust <[email protected]>
1 parent 30f0f7e commit 965b3bb

File tree

1 file changed

+73
-19
lines changed

1 file changed

+73
-19
lines changed

docs/sql-programming-guide.md

Lines changed: 73 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ title: Spark SQL and DataFrames
1111

1212
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
1313

14-
For how to enable Hive support, please refer to the [Hive Tables](#hive-tables) section.
14+
Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the [Hive Tables](#hive-tables) section.
1515

1616
# DataFrames
1717

@@ -213,6 +213,11 @@ df.groupBy("age").count().show()
213213
// 30 1
214214
{% endhighlight %}
215215

216+
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.DataFrame).
217+
218+
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/index.html#org.apache.spark.sql.DataFrame).
219+
220+
216221
</div>
217222

218223
<div data-lang="java" markdown="1">
@@ -263,6 +268,10 @@ df.groupBy("age").count().show();
263268
// 30 1
264269
{% endhighlight %}
265270

271+
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/java/org/apache/spark/sql/DataFrame.html).
272+
273+
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html).
274+
266275
</div>
267276

268277
<div data-lang="python" markdown="1">
@@ -320,6 +329,10 @@ df.groupBy("age").count().show()
320329

321330
{% endhighlight %}
322331

332+
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/pyspark.sql.html#pyspark.sql.DataFrame).
333+
334+
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/pyspark.sql.html#module-pyspark.sql.functions).
335+
323336
</div>
324337

325338
<div data-lang="r" markdown="1">
@@ -370,10 +383,13 @@ showDF(count(groupBy(df, "age")))
370383

371384
{% endhighlight %}
372385

373-
</div>
386+
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/R/index.html).
387+
388+
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/R/index.html).
374389

375390
</div>
376391

392+
</div>
377393

378394
## Running SQL Queries Programmatically
379395

@@ -870,12 +886,11 @@ saveDF(select(df, "name", "age"), "namesAndAges.parquet", "parquet")
870886

871887
Save operations can optionally take a `SaveMode`, that specifies how to handle existing data if
872888
present. It is important to realize that these save modes do not utilize any locking and are not
873-
atomic. Thus, it is not safe to have multiple writers attempting to write to the same location.
874-
Additionally, when performing a `Overwrite`, the data will be deleted before writing out the
889+
atomic. Additionally, when performing a `Overwrite`, the data will be deleted before writing out the
875890
new data.
876891

877892
<table class="table">
878-
<tr><th>Scala/Java</th><th>Python</th><th>Meaning</th></tr>
893+
<tr><th>Scala/Java</th><th>Any Language</th><th>Meaning</th></tr>
879894
<tr>
880895
<td><code>SaveMode.ErrorIfExists</code> (default)</td>
881896
<td><code>"error"</code> (default)</td>
@@ -1671,12 +1686,12 @@ results <- collect(sql(sqlContext, "FROM src SELECT key, value"))
16711686
### Interacting with Different Versions of Hive Metastore
16721687

16731688
One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore,
1674-
which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
1689+
which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary
1690+
build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
1691+
Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL
1692+
will compile against Hive 1.2.1 and use those classes for internal execution (serdes, UDFs, UDAFs, etc).
16751693

1676-
Internally, Spark SQL uses two Hive clients, one for executing native Hive commands like `SET`
1677-
and `DESCRIBE`, the other dedicated for communicating with Hive metastore. The former uses Hive
1678-
jars of version 0.13.1, which are bundled with Spark 1.4.0. The latter uses Hive jars of the
1679-
version specified by users. An isolated classloader is used here to avoid dependency conflicts.
1694+
The following options can be used to configure the version of Hive that is used to retrieve metadata:
16801695

16811696
<table class="table">
16821697
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
@@ -1685,7 +1700,7 @@ version specified by users. An isolated classloader is used here to avoid depend
16851700
<td><code>0.13.1</code></td>
16861701
<td>
16871702
Version of the Hive metastore. Available
1688-
options are <code>0.12.0</code> and <code>0.13.1</code>. Support for more versions is coming in the future.
1703+
options are <code>0.12.0</code> through <code>1.2.1</code>.
16891704
</td>
16901705
</tr>
16911706
<tr>
@@ -1696,12 +1711,16 @@ version specified by users. An isolated classloader is used here to avoid depend
16961711
property can be one of three options:
16971712
<ol>
16981713
<li><code>builtin</code></li>
1699-
Use Hive 0.13.1, which is bundled with the Spark assembly jar when <code>-Phive</code> is
1714+
Use Hive 1.2.1, which is bundled with the Spark assembly jar when <code>-Phive</code> is
17001715
enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be
1701-
either <code>0.13.1</code> or not defined.
1716+
either <code>1.2.1</code> or not defined.
17021717
<li><code>maven</code></li>
1703-
Use Hive jars of specified version downloaded from Maven repositories.
1704-
<li>A classpath in the standard format for both Hive and Hadoop.</li>
1718+
Use Hive jars of specified version downloaded from Maven repositories. This configuration
1719+
is not generally recommended for production deployments.
1720+
<li>A classpath in the standard format for the JVM. This classpath must include all of Hive
1721+
and its dependencies, including the correct version of Hadoop. These jars only need to be
1722+
present on the driver, but if you are running in yarn cluster mode then you must ensure
1723+
they are packaged with you application.</li>
17051724
</ol>
17061725
</td>
17071726
</tr>
@@ -2017,6 +2036,28 @@ options.
20172036

20182037
# Migration Guide
20192038

2039+
## Upgrading From Spark SQL 1.4 to 1.5
2040+
2041+
- Optimized execution using manually managed memory (Tungsten) is now enabled by default, along with
2042+
code generation for expression evaluation. These features can both be disabled by setting
2043+
`spark.sql.tungsten.enabled` to `false.
2044+
- Parquet schema merging is no longer enabled by default. It can be re-enabled by setting
2045+
`spark.sql.parquet.mergeSchema` to `true`.
2046+
- Resolution of strings to columns in python now supports using dots (`.`) to qualify the column or
2047+
access nested values. For example `df['table.column.nestedField']`. However, this means that if
2048+
your column name contains any dots you must now escape them using backticks (e.g., ``table.`column.with.dots`.nested``).
2049+
- In-memory columnar storage partition pruning is on by default. It can be disabled by setting
2050+
`spark.sql.inMemoryColumnarStorage.partitionPruning` to `false`.
2051+
- Unlimited precision decimal columns are no longer supported, instead Spark SQL enforces a maximum
2052+
precision of 38. When inferring schema from `BigDecimal` objects, a precision of (38, 18) is now
2053+
used. When no precision is specified in DDL then the default remains `Decimal(10, 0)`.
2054+
- Timestamps are now stored at a precision of 1us, rather than 1ns
2055+
- In the `sql` dialect, floating point numbers are now parsed as decimal. HiveQL parsing remains
2056+
unchanged.
2057+
- The canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM).
2058+
- It has been determined that using the DirectOutputCommitter when speculation is enabled is unsafe
2059+
and thus this output committer will not be used when speculation is on, independent of configuration.
2060+
20202061
## Upgrading from Spark SQL 1.3 to 1.4
20212062

20222063
#### DataFrame data reader/writer interface
@@ -2038,7 +2079,8 @@ See the API docs for `SQLContext.read` (
20382079

20392080
#### DataFrame.groupBy retains grouping columns
20402081

2041-
Based on user feedback, we changed the default behavior of `DataFrame.groupBy().agg()` to retain the grouping columns in the resulting `DataFrame`. To keep the behavior in 1.3, set `spark.sql.retainGroupColumns` to `false`.
2082+
Based on user feedback, we changed the default behavior of `DataFrame.groupBy().agg()` to retain the
2083+
grouping columns in the resulting `DataFrame`. To keep the behavior in 1.3, set `spark.sql.retainGroupColumns` to `false`.
20422084

20432085
<div class="codetabs">
20442086
<div data-lang="scala" markdown="1">
@@ -2175,7 +2217,7 @@ Python UDF registration is unchanged.
21752217
When using DataTypes in Python you will need to construct them (i.e. `StringType()`) instead of
21762218
referencing a singleton.
21772219

2178-
## Migration Guide for Shark User
2220+
## Migration Guide for Shark Users
21792221

21802222
### Scheduling
21812223
To set a [Fair Scheduler](job-scheduling.html#fair-scheduler-pools) pool for a JDBC client session,
@@ -2251,6 +2293,7 @@ Spark SQL supports the vast majority of Hive features, such as:
22512293
* User defined functions (UDF)
22522294
* User defined aggregation functions (UDAF)
22532295
* User defined serialization formats (SerDes)
2296+
* Window functions
22542297
* Joins
22552298
* `JOIN`
22562299
* `{LEFT|RIGHT|FULL} OUTER JOIN`
@@ -2261,7 +2304,7 @@ Spark SQL supports the vast majority of Hive features, such as:
22612304
* `SELECT col FROM ( SELECT a + b AS col from t1) t2`
22622305
* Sampling
22632306
* Explain
2264-
* Partitioned tables
2307+
* Partitioned tables including dynamic partition insertion
22652308
* View
22662309
* All Hive DDL Functions, including:
22672310
* `CREATE TABLE`
@@ -2323,8 +2366,9 @@ releases of Spark SQL.
23232366
Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS
23242367
metadata. Spark SQL does not support that.
23252368

2369+
# Reference
23262370

2327-
# Data Types
2371+
## Data Types
23282372

23292373
Spark SQL and DataFrames support the following data types:
23302374

@@ -2937,3 +2981,13 @@ from pyspark.sql.types import *
29372981

29382982
</div>
29392983

2984+
## NaN Semantics
2985+
2986+
There is specially handling for not-a-number (NaN) when dealing with `float` or `double` types that
2987+
does not exactly match standard floating point semantics.
2988+
Specifically:
2989+
2990+
- NaN = NaN returns true.
2991+
- In aggregations all NaN values are grouped together.
2992+
- NaN is treated as a normal value in join keys.
2993+
- NaN values go last when in ascending order, larger than any other numeric value.

0 commit comments

Comments
 (0)