Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
7315880
[SPARK-19572][SPARKR] Allow to disable hive in sparkR shell
zjffdu Mar 1, 2017
89cd384
[SPARK-19460][SPARKR] Update dataset used in R documentation, example…
wangmiao1981 Mar 1, 2017
4913c92
[SPARK-19633][SS] FileSource read from FileSink
lw-lin Mar 1, 2017
38e7835
[SPARK-19736][SQL] refreshByPath should clear all cached plans with t…
viirya Mar 1, 2017
5502a9c
[SPARK-19766][SQL] Constant alias columns in INNER JOIN should not be…
stanzhai Mar 1, 2017
8aa560b
[SPARK-19761][SQL] create InMemoryFileIndex with an empty rootPaths w…
windpiger Mar 1, 2017
417140e
[SPARK-19787][ML] Changing the default parameter of regParam.
datumbox Mar 1, 2017
2ff1467
[DOC][MINOR][SPARKR] Update SparkR doc for names, columns and colnames
actuaryzhang Mar 1, 2017
db0ddce
[SPARK-19775][SQL] Remove an obsolete `partitionBy().insertInto()` te…
dongjoon-hyun Mar 1, 2017
51be633
[SPARK-19777] Scan runningTasksSet when check speculatable tasks in T…
Mar 2, 2017
89990a0
[SPARK-13931] Stage can hang if an executor fails while speculated ta…
Mar 2, 2017
de2b53d
[SPARK-19583][SQL] CTAS for data source table with a created location…
windpiger Mar 2, 2017
3bd8ddf
[MINOR][ML] Fix comments in LSH Examples and Python API
Mar 2, 2017
d2a8797
[SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dro…
markgrover Mar 2, 2017
8d6ef89
[SPARK-18352][DOCS] wholeFile JSON update doc and programming guide
felixcheung Mar 2, 2017
625cfe0
[SPARK-19733][ML] Removed unnecessary castings and refactored checked…
datumbox Mar 2, 2017
50c08e8
[SPARK-19704][ML] AFTSurvivalRegression should support numeric censorCol
zhengruifeng Mar 2, 2017
9cca3db
[SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" usage in ALS
Mar 2, 2017
5ae3516
[SPARK-19720][CORE] Redact sensitive information from SparkSubmit con…
markgrover Mar 2, 2017
433d9eb
[SPARK-19631][CORE] OutputCommitCoordinator should not allow commits …
Mar 2, 2017
8417a7a
[SPARK-19276][CORE] Fetch Failure handling robust to user error handling
squito Mar 3, 2017
93ae176
[SPARK-19745][ML] SVCAggregator captures coefficients in its closure
sethah Mar 3, 2017
f37bb14
[SPARK-19602][SQL][TESTS] Add tests for qualified column names
skambha Mar 3, 2017
e24f21b
[SPARK-19779][SS] Delete needless tmp file after restart structured s…
gf53520 Mar 3, 2017
982f322
[SPARK-18726][SQL] resolveRelation for FileFormat DataSource don't ne…
windpiger Mar 3, 2017
d556b31
[SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV parser and minor …
HyukjinKwon Mar 3, 2017
fa50143
[SPARK-19739][CORE] propagate S3 session token to cluser
uncleGen Mar 3, 2017
0bac3e4
[SPARK-19797][DOC] ML pipeline document correction
ymwdalex Mar 3, 2017
776fac3
[SPARK-19801][BUILD] Remove JDK7 from Travis CI
dongjoon-hyun Mar 3, 2017
98bcc18
[SPARK-19758][SQL] Resolving timezone aware expressions with time zon…
viirya Mar 3, 2017
37a1c0e
[SPARK-19710][SQL][TESTS] Fix ordering of rows in query results
robbinspg Mar 3, 2017
9314c08
[SPARK-19774] StreamExecution should call stop() on sources when a st…
brkyvz Mar 3, 2017
ba186a8
[MINOR][DOC] Fix doc for web UI https configuration
jerryshao Mar 3, 2017
2a7921a
[SPARK-18939][SQL] Timezone support in partition values.
ueshin Mar 4, 2017
44281ca
[SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe
BryanCutler Mar 4, 2017
f5fdbe0
[SPARK-13446][SQL] Support reading data from Hive 2.0.1 metastore
gatorsmile Mar 4, 2017
a6a7a95
[SPARK-19718][SS] Handle more interrupt cases properly for Hadoop
zsxwing Mar 4, 2017
9e5b4ce
[SPARK-19084][SQL] Ensure context class loader is set when initializi…
Mar 4, 2017
fbc4058
[SPARK-19816][SQL][TESTS] Fix an issue that DataFrameCallbackSuite do…
zsxwing Mar 4, 2017
6b0cfd9
[SPARK-19550][SPARKR][DOCS] Update R document to use JDK8
wangyum Mar 4, 2017
42c4cd9
[SPARK-19792][WEBUI] In the Master Page,the column named “Memory per …
10110346 Mar 5, 2017
f48461a
[SPARK-19805][TEST] Log the row type when query result dose not match
uncleGen Mar 5, 2017
14bb398
[SPARK-19254][SQL] Support Seq, Map, and Struct in functions.lit
maropu Mar 5, 2017
80d5338
[SPARK-19795][SPARKR] add column functions to_json, from_json
felixcheung Mar 5, 2017
369a148
[SPARK-19595][SQL] Support json array in from_json
HyukjinKwon Mar 5, 2017
70f9d7f
[SPARK-19535][ML] RecommendForAllUsers RecommendForAllItems for ALS o…
sueann Mar 6, 2017
224e0e7
[SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operat…
HyukjinKwon Mar 6, 2017
207067e
[SPARK-19822][TEST] CheckpointSuite.testCheckpointedOperation: should…
uncleGen Mar 6, 2017
2a0bc86
[SPARK-17495][SQL] Support Decimal type in Hive-hash
tejasapatil Mar 6, 2017
339b53a
[SPARK-19737][SQL] New analysis rule for reporting unregistered funct…
liancheng Mar 6, 2017
46a64d1
[SPARK-19304][STREAMING][KINESIS] fix kinesis slow checkpoint recovery
Gauravshah Mar 6, 2017
096df6d
[SPARK-19257][SQL] location for table/partition/database should be ja…
windpiger Mar 6, 2017
12bf832
[SPARK-19796][CORE] Fix serialization of long property values in Task…
squito Mar 6, 2017
9991c2d
[SPARK-19211][SQL] Explicitly prevent Insert into View or Create View…
jiangxb1987 Mar 6, 2017
9265436
[SPARK-19382][ML] Test sparse vectors in LinearSVCSuite
wangmiao1981 Mar 6, 2017
f6471dc
[SPARK-19709][SQL] Read empty file with CSV data source
wojtek-szymanski Mar 6, 2017
b0a5cd8
[SPARK-19719][SS] Kafka writer for both structured streaming and batc…
Mar 7, 2017
9909f6d
[SPARK-19350][SQL] Cardinality estimation of Limit and Sample
Mar 7, 2017
1f6c090
[SPARK-19818][SPARKR] rbind should check for name consistency of inpu…
actuaryzhang Mar 7, 2017
e52499e
[SPARK-19832][SQL] DynamicPartitionWriteTask get partitionPath should…
windpiger Mar 7, 2017
932196d
[SPARK-17075][SQL][FOLLOWUP] fix filter estimation issues
Mar 7, 2017
030acdd
[SPARK-19637][SQL] Add to_json in FunctionRegistry
maropu Mar 7, 2017
c05baab
[SPARK-19765][SPARK-18549][SQL] UNCACHE TABLE should un-cache all cac…
cloud-fan Mar 7, 2017
4a9034b
[SPARK-17498][ML] StringIndexer enhancement for handling unseen labels
Mar 7, 2017
d69aeea
[SPARK-19516][DOC] update public doc to use SparkSession instead of S…
cloud-fan Mar 7, 2017
49570ed
[SPARK-19803][TEST] flaky BlockManagerReplicationSuite test failure
uncleGen Mar 7, 2017
6f46846
[SPARK-19561] [PYTHON] cast TimestampType.toInternal output to long
Mar 7, 2017
2e30c0b
[SPARK-19702][MESOS] Increase default refuse_seconds timeout in the M…
Mar 7, 2017
8e41c2e
[SPARK-19857][YARN] Correctly calculate next credential update time.
Mar 8, 2017
47b2f68
Revert "[SPARK-19561] [PYTHON] cast TimestampType.toInternal output t…
cloud-fan Mar 8, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[SPARK-18352][DOCS] wholeFile JSON update doc and programming guide
## What changes were proposed in this pull request?

Update doc for R, programming guide. Clarify default behavior for all languages.

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes apache#17128 from felixcheung/jsonwholefiledoc.
  • Loading branch information
felixcheung authored and Felix Cheung committed Mar 2, 2017
commit 8d6ef895ee492b8febbaac7ab2ef2c907b48fa4a
10 changes: 7 additions & 3 deletions R/pkg/R/SQLContext.R
Original file line number Diff line number Diff line change
Expand Up @@ -332,8 +332,10 @@ setMethod("toDF", signature(x = "RDD"),

#' Create a SparkDataFrame from a JSON file.
#'
#' Loads a JSON file (\href{http://jsonlines.org/}{JSON Lines text format or newline-delimited JSON}
#' ), returning the result as a SparkDataFrame
#' Loads a JSON file, returning the result as a SparkDataFrame
#' By default, (\href{http://jsonlines.org/}{JSON Lines text format or newline-delimited JSON}
#' ) is supported. For JSON (one record per file), set a named property \code{wholeFile} to
#' \code{TRUE}.
#' It goes through the entire dataset once to determine the schema.
#'
#' @param path Path of file to read. A vector of multiple paths is allowed.
Expand All @@ -346,6 +348,7 @@ setMethod("toDF", signature(x = "RDD"),
#' sparkR.session()
#' path <- "path/to/file.json"
#' df <- read.json(path)
#' df <- read.json(path, wholeFile = TRUE)
#' df <- jsonFile(path)
#' }
#' @name read.json
Expand Down Expand Up @@ -778,14 +781,15 @@ dropTempView <- function(viewName) {
#' @return SparkDataFrame
#' @rdname read.df
#' @name read.df
#' @seealso \link{read.json}
#' @export
#' @examples
#'\dontrun{
#' sparkR.session()
#' df1 <- read.df("path/to/file.json", source = "json")
#' schema <- structType(structField("name", "string"),
#' structField("info", "map<string,double>"))
#' df2 <- read.df(mapTypeJsonPath, "json", schema)
#' df2 <- read.df(mapTypeJsonPath, "json", schema, wholeFile = TRUE)
#' df3 <- loadDF("data/test_table", "parquet", mergeSchema = "true")
#' }
#' @name read.df
Expand Down
26 changes: 15 additions & 11 deletions docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -386,8 +386,8 @@ For example:

The [built-in DataFrames functions](api/scala/index.html#org.apache.spark.sql.functions$) provide common
aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, etc.
While those functions are designed for DataFrames, Spark SQL also has type-safe versions for some of them in
[Scala](api/scala/index.html#org.apache.spark.sql.expressions.scalalang.typed$) and
While those functions are designed for DataFrames, Spark SQL also has type-safe versions for some of them in
[Scala](api/scala/index.html#org.apache.spark.sql.expressions.scalalang.typed$) and
[Java](api/java/org/apache/spark/sql/expressions/javalang/typed.html) to work with strongly typed Datasets.
Moreover, users are not limited to the predefined aggregate functions and can create their own.

Expand All @@ -397,7 +397,7 @@ Moreover, users are not limited to the predefined aggregate functions and can cr

<div data-lang="scala" markdown="1">

Users have to extend the [UserDefinedAggregateFunction](api/scala/index.html#org.apache.spark.sql.expressions.UserDefinedAggregateFunction)
Users have to extend the [UserDefinedAggregateFunction](api/scala/index.html#org.apache.spark.sql.expressions.UserDefinedAggregateFunction)
abstract class to implement a custom untyped aggregate function. For example, a user-defined average
can look like:

Expand Down Expand Up @@ -888,8 +888,9 @@ or a JSON file.

Note that the file that is offered as _a json file_ is not a typical JSON file. Each
line must contain a separate, self-contained valid JSON object. For more information, please see
[JSON Lines text format, also called newline-delimited JSON](http://jsonlines.org/). As a
consequence, a regular multi-line JSON file will most often fail.
[JSON Lines text format, also called newline-delimited JSON](http://jsonlines.org/).

For a regular multi-line JSON file, set the `wholeFile` option to `true`.

{% include_example json_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
</div>
Expand All @@ -901,8 +902,9 @@ or a JSON file.

Note that the file that is offered as _a json file_ is not a typical JSON file. Each
line must contain a separate, self-contained valid JSON object. For more information, please see
[JSON Lines text format, also called newline-delimited JSON](http://jsonlines.org/). As a
consequence, a regular multi-line JSON file will most often fail.
[JSON Lines text format, also called newline-delimited JSON](http://jsonlines.org/).

For a regular multi-line JSON file, set the `wholeFile` option to `true`.

{% include_example json_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
</div>
Expand All @@ -913,8 +915,9 @@ This conversion can be done using `SparkSession.read.json` on a JSON file.

Note that the file that is offered as _a json file_ is not a typical JSON file. Each
line must contain a separate, self-contained valid JSON object. For more information, please see
[JSON Lines text format, also called newline-delimited JSON](http://jsonlines.org/). As a
consequence, a regular multi-line JSON file will most often fail.
[JSON Lines text format, also called newline-delimited JSON](http://jsonlines.org/).

For a regular multi-line JSON file, set the `wholeFile` parameter to `True`.

{% include_example json_dataset python/sql/datasource.py %}
</div>
Expand All @@ -926,8 +929,9 @@ files is a JSON object.

Note that the file that is offered as _a json file_ is not a typical JSON file. Each
line must contain a separate, self-contained valid JSON object. For more information, please see
[JSON Lines text format, also called newline-delimited JSON](http://jsonlines.org/). As a
consequence, a regular multi-line JSON file will most often fail.
[JSON Lines text format, also called newline-delimited JSON](http://jsonlines.org/).

For a regular multi-line JSON file, set a named parameter `wholeFile` to `TRUE`.

{% include_example json_dataset r/RSparkSQLExample.R %}

Expand Down
4 changes: 2 additions & 2 deletions python/pyspark/sql/readwriter.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,8 +163,8 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
"""
Loads a JSON file and returns the results as a :class:`DataFrame`.

Both JSON (one record per file) and `JSON Lines <http://jsonlines.org/>`_
(newline-delimited JSON) are supported and can be selected with the `wholeFile` parameter.
`JSON Lines <http://jsonlines.org/>`_(newline-delimited JSON) is supported by default.
For JSON (one record per file), set the `wholeFile` parameter to ``true``.

If the ``schema`` parameter is not specified, this function goes
through the input once to determine the input schema.
Expand Down
4 changes: 2 additions & 2 deletions python/pyspark/sql/streaming.py
Original file line number Diff line number Diff line change
Expand Up @@ -433,8 +433,8 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
"""
Loads a JSON file stream and returns the results as a :class:`DataFrame`.

Both JSON (one record per file) and `JSON Lines <http://jsonlines.org/>`_
(newline-delimited JSON) are supported and can be selected with the `wholeFile` parameter.
`JSON Lines <http://jsonlines.org/>`_(newline-delimited JSON) is supported by default.
For JSON (one record per file), set the `wholeFile` parameter to ``true``.

If the ``schema`` parameter is not specified, this function goes
through the input once to determine the input schema.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -263,8 +263,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
/**
* Loads a JSON file and returns the results as a `DataFrame`.
*
* Both JSON (one record per file) and <a href="http://jsonlines.org/">JSON Lines</a>
* (newline-delimited JSON) are supported and can be selected with the `wholeFile` option.
* <a href="http://jsonlines.org/">JSON Lines</a> (newline-delimited JSON) is supported by
* default. For JSON (one record per file), set the `wholeFile` option to true.
*
* This function goes through the input once to determine the input schema. If you know the
* schema in advance, use the version that specifies the schema to avoid the extra scan.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,8 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
/**
* Loads a JSON file stream and returns the results as a `DataFrame`.
*
* Both JSON (one record per file) and <a href="http://jsonlines.org/">JSON Lines</a>
* (newline-delimited JSON) are supported and can be selected with the `wholeFile` option.
* <a href="http://jsonlines.org/">JSON Lines</a> (newline-delimited JSON) is supported by
* default. For JSON (one record per file), set the `wholeFile` option to true.
*
* This function goes through the input once to determine the input schema. If you know the
* schema in advance, use the version that specifies the schema to avoid the extra scan.
Expand Down