You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sql-programming-guide.md
+73-19Lines changed: 73 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ title: Spark SQL and DataFrames
11
11
12
12
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
13
13
14
-
For how to enable Hive support, please refer to the [Hive Tables](#hive-tables) section.
14
+
Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the [Hive Tables](#hive-tables) section.
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.DataFrame).
217
+
218
+
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/index.html#org.apache.spark.sql.DataFrame).
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/java/org/apache/spark/sql/DataFrame.html).
272
+
273
+
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html).
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/pyspark.sql.html#pyspark.sql.DataFrame).
333
+
334
+
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/pyspark.sql.html#module-pyspark.sql.functions).
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/R/index.html).
387
+
388
+
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/R/index.html).
### Interacting with Different Versions of Hive Metastore
1672
1687
1673
1688
One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore,
1674
-
which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
1689
+
which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary
1690
+
build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
1691
+
Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL
1692
+
will compile against Hive 1.2.1 and use those classes for internal execution (serdes, UDFs, UDAFs, etc).
1675
1693
1676
-
Internally, Spark SQL uses two Hive clients, one for executing native Hive commands like `SET`
1677
-
and `DESCRIBE`, the other dedicated for communicating with Hive metastore. The former uses Hive
1678
-
jars of version 0.13.1, which are bundled with Spark 1.4.0. The latter uses Hive jars of the
1679
-
version specified by users. An isolated classloader is used here to avoid dependency conflicts.
1694
+
The following options can be used to configure the version of Hive that is used to retrieve metadata:
@@ -1685,7 +1700,7 @@ version specified by users. An isolated classloader is used here to avoid depend
1685
1700
<td><code>0.13.1</code></td>
1686
1701
<td>
1687
1702
Version of the Hive metastore. Available
1688
-
options are <code>0.12.0</code> and <code>0.13.1</code>. Support for more versions is coming in the future.
1703
+
options are <code>0.12.0</code> through <code>1.2.1</code>.
1689
1704
</td>
1690
1705
</tr>
1691
1706
<tr>
@@ -1696,12 +1711,16 @@ version specified by users. An isolated classloader is used here to avoid depend
1696
1711
property can be one of three options:
1697
1712
<ol>
1698
1713
<li><code>builtin</code></li>
1699
-
Use Hive 0.13.1, which is bundled with the Spark assembly jar when <code>-Phive</code> is
1714
+
Use Hive 1.2.1, which is bundled with the Spark assembly jar when <code>-Phive</code> is
1700
1715
enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be
1701
-
either <code>0.13.1</code> or not defined.
1716
+
either <code>1.2.1</code> or not defined.
1702
1717
<li><code>maven</code></li>
1703
-
Use Hive jars of specified version downloaded from Maven repositories.
1704
-
<li>A classpath in the standard format for both Hive and Hadoop.</li>
1718
+
Use Hive jars of specified version downloaded from Maven repositories. This configuration
1719
+
is not generally recommended for production deployments.
1720
+
<li>A classpath in the standard format for the JVM. This classpath must include all of Hive
1721
+
and its dependencies, including the correct version of Hadoop. These jars only need to be
1722
+
present on the driver, but if you are running in yarn cluster mode then you must ensure
1723
+
they are packaged with you application.</li>
1705
1724
</ol>
1706
1725
</td>
1707
1726
</tr>
@@ -2017,6 +2036,28 @@ options.
2017
2036
2018
2037
# Migration Guide
2019
2038
2039
+
## Upgrading From Spark SQL 1.4 to 1.5
2040
+
2041
+
- Optimized execution using manually managed memory (Tungsten) is now enabled by default, along with
2042
+
code generation for expression evaluation. These features can both be disabled by setting
2043
+
`spark.sql.tungsten.enabled` to `false.
2044
+
- Parquet schema merging is no longer enabled by default. It can be re-enabled by setting
2045
+
`spark.sql.parquet.mergeSchema` to `true`.
2046
+
- Resolution of strings to columns in python now supports using dots (`.`) to qualify the column or
2047
+
access nested values. For example `df['table.column.nestedField']`. However, this means that if
2048
+
your column name contains any dots you must now escape them using backticks (e.g., ``table.`column.with.dots`.nested``).
2049
+
- In-memory columnar storage partition pruning is on by default. It can be disabled by setting
2050
+
`spark.sql.inMemoryColumnarStorage.partitionPruning` to `false`.
2051
+
- Unlimited precision decimal columns are no longer supported, instead Spark SQL enforces a maximum
2052
+
precision of 38. When inferring schema from `BigDecimal` objects, a precision of (38, 18) is now
2053
+
used. When no precision is specified in DDL then the default remains `Decimal(10, 0)`.
2054
+
- Timestamps are now stored at a precision of 1us, rather than 1ns
2055
+
- In the `sql` dialect, floating point numbers are now parsed as decimal. HiveQL parsing remains
2056
+
unchanged.
2057
+
- The canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM).
2058
+
- It has been determined that using the DirectOutputCommitter when speculation is enabled is unsafe
2059
+
and thus this output committer will not be used when speculation is on, independent of configuration.
2060
+
2020
2061
## Upgrading from Spark SQL 1.3 to 1.4
2021
2062
2022
2063
#### DataFrame data reader/writer interface
@@ -2038,7 +2079,8 @@ See the API docs for `SQLContext.read` (
2038
2079
2039
2080
#### DataFrame.groupBy retains grouping columns
2040
2081
2041
-
Based on user feedback, we changed the default behavior of `DataFrame.groupBy().agg()` to retain the grouping columns in the resulting `DataFrame`. To keep the behavior in 1.3, set `spark.sql.retainGroupColumns` to `false`.
2082
+
Based on user feedback, we changed the default behavior of `DataFrame.groupBy().agg()` to retain the
2083
+
grouping columns in the resulting `DataFrame`. To keep the behavior in 1.3, set `spark.sql.retainGroupColumns` to `false`.
2042
2084
2043
2085
<divclass="codetabs">
2044
2086
<divdata-lang="scala"markdown="1">
@@ -2175,7 +2217,7 @@ Python UDF registration is unchanged.
2175
2217
When using DataTypes in Python you will need to construct them (i.e. `StringType()`) instead of
2176
2218
referencing a singleton.
2177
2219
2178
-
## Migration Guide for Shark User
2220
+
## Migration Guide for Shark Users
2179
2221
2180
2222
### Scheduling
2181
2223
To set a [Fair Scheduler](job-scheduling.html#fair-scheduler-pools) pool for a JDBC client session,
@@ -2251,6 +2293,7 @@ Spark SQL supports the vast majority of Hive features, such as:
2251
2293
* User defined functions (UDF)
2252
2294
* User defined aggregation functions (UDAF)
2253
2295
* User defined serialization formats (SerDes)
2296
+
* Window functions
2254
2297
* Joins
2255
2298
*`JOIN`
2256
2299
*`{LEFT|RIGHT|FULL} OUTER JOIN`
@@ -2261,7 +2304,7 @@ Spark SQL supports the vast majority of Hive features, such as:
2261
2304
*`SELECT col FROM ( SELECT a + b AS col from t1) t2`
2262
2305
* Sampling
2263
2306
* Explain
2264
-
* Partitioned tables
2307
+
* Partitioned tables including dynamic partition insertion
2265
2308
* View
2266
2309
* All Hive DDL Functions, including:
2267
2310
*`CREATE TABLE`
@@ -2323,8 +2366,9 @@ releases of Spark SQL.
2323
2366
Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS
2324
2367
metadata. Spark SQL does not support that.
2325
2368
2369
+
# Reference
2326
2370
2327
-
# Data Types
2371
+
##Data Types
2328
2372
2329
2373
Spark SQL and DataFrames support the following data types:
2330
2374
@@ -2937,3 +2981,13 @@ from pyspark.sql.types import *
2937
2981
2938
2982
</div>
2939
2983
2984
+
## NaN Semantics
2985
+
2986
+
There is specially handling for not-a-number (NaN) when dealing with `float` or `double` types that
2987
+
does not exactly match standard floating point semantics.
2988
+
Specifically:
2989
+
2990
+
- NaN = NaN returns true.
2991
+
- In aggregations all NaN values are grouped together.
2992
+
- NaN is treated as a normal value in join keys.
2993
+
- NaN values go last when in ascending order, larger than any other numeric value.
0 commit comments