Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
1347 commits
Select commit Hold shift + click to select a range
645c3a8
[SPARK-13423][HOTFIX] Static analysis fixes for 2.x / fixed for Scala…
srowen Mar 3, 2016
70f6f96
[SPARK-13013][DOCS] Replace example code in mllib-clustering.md using…
keypointt Mar 3, 2016
9a48c65
[SPARK-13599][BUILD] remove transitive groovy dependencies from Hive
steveloughran Mar 3, 2016
511d492
[SPARK-12877][ML] Add train-validation-split to pyspark
JeremyNixon Mar 3, 2016
cf95d72
[SPARK-13543][SQL] Support for specifying compression codec for Parqu…
HyukjinKwon Mar 3, 2016
ce58e99
[MINOR][ML][DOC] Remove duplicated periods at the end of some sharedP…
yanboliang Mar 3, 2016
52035d1
[SPARK-13423][HOTFIX] Static analysis fixes for 2.x / fixed for Scala…
srowen Mar 3, 2016
941b270
[MINOR] Fix typos in comments and testcase name of code
dongjoon-hyun Mar 3, 2016
3edcc40
[SPARK-13632][SQL] Move commands.scala to command package
Mar 3, 2016
ad0de99
[SPARK-13584][SQL][TESTS] Make ContinuousQueryManagerSuite not output…
zsxwing Mar 3, 2016
b373a88
[SPARK-13415][SQL] Visualize subquery in SQL web UI
Mar 4, 2016
d062587
[SPARK-13601] [TESTS] use 1 partition in tests to avoid race conditions
Mar 4, 2016
15d57f9
[SPARK-13647] [SQL] also check if numeric value is within allowed ran…
cloud-fan Mar 4, 2016
f6ac7c3
[SPARK-12941][SQL][MASTER] Spark-SQL JDBC Oracle dialect fails to map…
thomastechs Mar 4, 2016
465c665
[SPARK-13652][CORE] Copy ByteBuffer in sendRpcSync as it will be recy…
zsxwing Mar 4, 2016
dd83c20
[SPARK-13603][SQL] support SQL generation for subquery
Mar 4, 2016
27e88fa
[SPARK-13646][MLLIB] QuantileDiscretizer counts dataset twice in get…
eliasah Mar 4, 2016
c04dc27
[SPARK-13398][STREAMING] Move away from thread pool task support to f…
holdenk Mar 4, 2016
204b02b
[SPARK-12925] Improve HiveInspectors.unwrap for StringObjectInspector.…
rbalamohan Mar 4, 2016
e617508
[SPARK-13673][WINDOWS] Fixed not to pollute environment variables.
tsudukim Mar 4, 2016
c8f2545
[SPARK-13676] Fix mismatched default values for regParam in LogisticR…
dongjoon-hyun Mar 4, 2016
83302c3
[SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.py
yinxusen Mar 4, 2016
b7d4147
[SPARK-13633][SQL] Move things into catalyst.parser package
Mar 4, 2016
5f42c28
[SPARK-13459][WEB UI] Separate Alive and Dead Executors in Executor T…
ajbozarth Mar 4, 2016
a6e2bd3
[SPARK-13255] [SQL] Update vectorized reader to directly return Colum…
nongli Mar 4, 2016
f19228e
[SPARK-12073][STREAMING] backpressure rate controller consumes events…
Mar 5, 2016
adce5ee
[SPARK-12720][SQL] SQL Generation Support for Cube, Rollup, and Group…
gatorsmile Mar 5, 2016
8290004
[SPARK-13693][STREAMING][TESTS] Stop StreamingContext before deleting…
zsxwing Mar 5, 2016
8ff8809
Revert "[SPARK-13616][SQL] Let SQLBuilder convert logical plan withou…
liancheng Mar 6, 2016
ee913e6
[SPARK-13697] [PYSPARK] Fix the missing module name of TransformFunct…
zsxwing Mar 6, 2016
bc7a3ec
[SPARK-13685][SQL] Rename catalog.Catalog to ExternalCatalog
Mar 7, 2016
4b13896
[SPARK-13705][DOCS] UpdateStateByKey Operation documentation incorrec…
Mar 7, 2016
03f57a6
Fixing the type of the sentiment happiness value
heliocentrist Mar 7, 2016
d7eac9d
[SPARK-13651] Generator outputs are not resolved correctly resulting …
dilipbiswal Mar 7, 2016
4896411
[SPARK-13694][SQL] QueryPlan.expressions should always include all ex…
cloud-fan Mar 7, 2016
ef77003
[SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joi…
sameeragarwal Mar 7, 2016
e72914f
[SPARK-12243][BUILD][PYTHON] PySpark tests are slow in Jenkins.
dongjoon-hyun Mar 7, 2016
a3ec50a
[MINOR][DOC] improve the doc for "spark.memory.offHeap.size"
CodingCat Mar 7, 2016
b6071a7
[SPARK-13722][SQL] No Push Down for Non-deterministics Predicates thr…
gatorsmile Mar 7, 2016
e9e67b3
[SPARK-13655] Improve isolation between tests in KinesisBackedBlockRD…
JoshRosen Mar 7, 2016
e1fb857
[SPARK-529][CORE][YARN] Add type-safe config keys to SparkConf.
Mar 7, 2016
8577260
[SPARK-13442][SQL] Make type inference recognize boolean types
HyukjinKwon Mar 7, 2016
0eea12a
[SPARK-13596][BUILD] Move misc top-level build files into appropriate…
srowen Mar 7, 2016
e720dda
[SPARK-13665][SQL] Separate the concerns of HadoopFsRelation
marmbrus Mar 7, 2016
46f25c2
[SPARK-13648] Add Hive Cli to classes for isolated classloader
preecet Mar 7, 2016
da7bfac
[SPARK-13689][SQL] Move helper things in CatalystQl to new utils object
Mar 8, 2016
25bba58
[SPARK-13404] [SQL] Create variables for input row when it's actually…
Mar 8, 2016
017cdf2
[SPARK-13711][CORE] Don't call SparkUncaughtExceptionHandler in AppCl…
zsxwing Mar 8, 2016
e52e597
[SPARK-13659] Refactor BlockStore put*() APIs to remove returnValues
JoshRosen Mar 8, 2016
7771c73
[HOT-FIX][BUILD] Use the new location of `checkstyle-suppressions.xml`
dongjoon-hyun Mar 8, 2016
9bf76dd
[SPARK-13117][WEB UI] WebUI should use the local ip not 0.0.0.0
Mar 8, 2016
9e86e6e
[SPARK-13675][UI] Fix wrong historyserver url link for application ru…
jerryshao Mar 8, 2016
7d05d02
[SPARK-13637][SQL] use more information to simplify the code in Expan…
cloud-fan Mar 8, 2016
ca1a7b9
[HOTFIX][YARN] Fix yarn cluster mode fire and forget regression
jerryshao Mar 8, 2016
54040f8
[SPARK-13715][MLLIB] Remove last usages of jblas in tests
srowen Mar 8, 2016
78d3b60
[SPARK-13657] [SQL] Support parsing very long AND/OR expressions
Mar 8, 2016
ad3c9a9
[SPARK-13695] Don't cache MEMORY_AND_DISK blocks as bytes in memory a…
JoshRosen Mar 8, 2016
46881b4
[SPARK-12727][SQL] support SQL generation for aggregate with multi-di…
cloud-fan Mar 8, 2016
9740954
[ML] testEstimatorAndModelReadWrite should call checkModelData
yanboliang Mar 8, 2016
d5ce617
[SPARK-13740][SQL] add null check for _verify_type in types.py
cloud-fan Mar 8, 2016
d57daf1
[SPARK-13593] [SQL] improve the `createDataFrame` to accept data type…
cloud-fan Mar 8, 2016
076009b
[SPARK-13400] Stop using deprecated Octal escape literals
dongjoon-hyun Mar 8, 2016
1e28840
[SPARK-13738][SQL] Cleanup Data Source resolution
marmbrus Mar 8, 2016
e430614
[SPARK-13668][SQL] Reorder filter/join predicates to short-circuit is…
sameeragarwal Mar 8, 2016
81f54ac
[SPARK-13755] Escape quotes in SQL plan visualization node labels
JoshRosen Mar 9, 2016
d8813fa
[SPARK-13625][PYSPARK][ML] Added a check to see if an attribute is a …
BryanCutler Mar 9, 2016
982ef2b
[SPARK-13750][SQL] fix sizeInBytes of HadoopFsRelation
Mar 9, 2016
cc4ab37
[SPARK-13754] Keep old data source name for backwards compatibility
falaki Mar 9, 2016
035d3ac
[SPARK-7286][SQL] Deprecate !== in favour of =!=
jodersky Mar 9, 2016
f3201ae
[SPARK-13692][CORE][SQL] Fix trivial Coverity/Checkstyle defects
dongjoon-hyun Mar 9, 2016
2c5af7d
[SPARK-13640][SQL] Synchronize ScalaReflection.mirror method.
ueshin Mar 9, 2016
cbff280
[SPARK-13631][CORE] Thread-safe getLocationsWithLargestOutputs
Mar 9, 2016
c3689bc
[SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic inst…
dongjoon-hyun Mar 9, 2016
8e8633e
[SPARK-13769][CORE] Update Java Doc in Spark Submit
Mar 9, 2016
53ba6d6
[SPARK-13698][SQL] Fix Analysis Exceptions when Using Backticks in Ge…
dilipbiswal Mar 9, 2016
9634e17
[SPARK-13242] [SQL] codegen fallback in case-when if there many branches
Mar 9, 2016
7791d0c
Revert "[SPARK-13668][SQL] Reorder filter/join predicates to short-ci…
davies Mar 9, 2016
256704c
[SPARK-13595][BUILD] Move docker, extras modules into external
srowen Mar 9, 2016
23369c3
[SPARK-13763][SQL] Remove Project when its Child's Output is Nil
gatorsmile Mar 9, 2016
cad29a4
[SPARK-13728][SQL] Fix ORC PPD test so that pushed filters can be che…
HyukjinKwon Mar 9, 2016
0dd0648
[SPARK-13615][ML] GeneralizedLinearRegression supports save/load
yanboliang Mar 9, 2016
3dc9ae2
[SPARK-13523] [SQL] Reuse exchanges in a query
Mar 9, 2016
c6aa356
[SPARK-13527][SQL] Prune Filters based on Constraints
gatorsmile Mar 9, 2016
e1772d3
[SPARK-11861][ML] Add feature importances for decision trees
sethah Mar 9, 2016
dbf2a7c
[SPARK-13781][SQL] Use ExpressionSets in ConstraintPropagationSuite
sameeragarwal Mar 9, 2016
37fcda3
[SPARK-13747][SQL] Fix concurrent query with fork-join pool
Mar 10, 2016
40e0676
[SPARK-13778][CORE] Set the executor state for a worker when removing it
zsxwing Mar 10, 2016
238447d
[SPARK-13775] History page sorted by completed time desc by default.
Mar 10, 2016
5f7dbdb
[MINOR] Fix typo in 'hypot' docstring
tristanreid Mar 10, 2016
a4a0add
[SPARK-13492][MESOS] Configurable Mesos framework webui URL.
Mar 10, 2016
926e9c4
[SPARK-13760][SQL] Fix BigDecimal constructor for FloatType
sameeragarwal Mar 10, 2016
7906461
Revert "[SPARK-13760][SQL] Fix BigDecimal constructor for FloatType"
yhuai Mar 10, 2016
aa0eba2
[SPARK-13766][SQL] Consistent file extensions for files written by in…
HyukjinKwon Mar 10, 2016
8a3acb7
[SPARK-13794][SQL] Rename DataFrameWriter.stream() DataFrameWriter.st…
rxin Mar 10, 2016
8bcad28
[SPARK-7420][STREAMING][TESTS] Enable test: o.a.s.streaming.JobGenera…
lw-lin Mar 10, 2016
3e3c3d5
[SPARK-13706][ML] Add Python Example for Train Validation Split
JeremyNixon Mar 10, 2016
9525c56
[MINOR][SQL] Replace DataFrameWriter.stream() with startStream() in c…
dongjoon-hyun Mar 10, 2016
9fe38ab
[SPARK-11108][ML] OneHotEncoder should support other numeric types
sethah Mar 10, 2016
927e22e
[SPARK-13663][CORE] Upgrade Snappy Java to 1.1.2.1
srowen Mar 10, 2016
74267be
[SPARK-13758][STREAMING][CORE] enhance exception message to avoid mis…
wei-mao-intel Mar 10, 2016
d24801a
[SPARK-13636] [SQL] Directly consume UnsafeRow in wholestage codegen …
viirya Mar 10, 2016
235f4ac
[SPARK-13727][CORE] SparkConf.contains does not consider deprecated keys
Mar 10, 2016
19f4ac6
[SPARK-13759][SQL] Add IsNotNull constraints for expressions with an …
sameeragarwal Mar 10, 2016
747d2f5
[SPARK-13790] Speed up ColumnVector's getDecimal
nongli Mar 10, 2016
3d2b6f5
[SQL][TEST] Increased timeouts to reduce flakiness in ContinuousQuery…
tdas Mar 10, 2016
81d4853
[SPARK-13696] Remove BlockStore class & simplify interfaces of mem. &…
JoshRosen Mar 10, 2016
91fed8e
[SPARK-3854][BUILD] Scala style: require spaces before `{`.
dongjoon-hyun Mar 10, 2016
020ff8c
[SPARK-13751] [SQL] generate better code for Filter
Mar 11, 2016
27fe6ba
[SPARK-13604][CORE] Sync worker's state after registering with master
zsxwing Mar 11, 2016
1d54278
[SPARK-13244][SQL] Migrates DataFrame to Dataset
liancheng Mar 11, 2016
88fa866
[MINOR][DOC] Fix supported hive version in doc
dongjoon-hyun Mar 11, 2016
416e71a
[SPARK-13327][SPARKR] Added parameter validations for colnames<-
Mar 11, 2016
c3a6269
[SPARK-13789] Infer additional constraints from attribute equality
sameeragarwal Mar 11, 2016
4d535d1
[SPARK-13389][SPARKR] SparkR support first/last with ignore NAs
yanboliang Mar 11, 2016
560489f
[SPARK-13732][SPARK-13797][SQL] Remove projectList from Window and El…
gatorsmile Mar 11, 2016
6871cc8
[SPARK-12718][SPARK-13720][SQL] SQL generation support for window fun…
cloud-fan Mar 11, 2016
74c4e26
[HOT-FIX] fix compile
cloud-fan Mar 11, 2016
e33bc67
[MINOR][CORE] Fix a duplicate "and" in a log message.
Mar 11, 2016
d18276c
[SPARK-13672][ML] Add python examples of BisectingKMeans in ML and MLLIB
zhengruifeng Mar 11, 2016
6ca990f
[SPARK-13294][PROJECT INFRA] Remove MiMa's dependency on spark-class …
JoshRosen Mar 11, 2016
0b713e0
[SPARK-13512][ML] add example and doc for MaxAbsScaler
hhbyyh Mar 11, 2016
234f781
[SPARK-13787][ML][PYSPARK] Pyspark feature importances for decision t…
sethah Mar 11, 2016
8fff0f9
[HOT-FIX][SQL][ML] Fix compile error from use of DataFrame in Java Ma…
MLnick Mar 11, 2016
07f1c54
[SPARK-13577][YARN] Allow Spark jar to be multiple jars, archive.
Mar 11, 2016
6d37e1e
[SPARK-13817][BUILD][SQL] Re-enable MiMA and removes object DataFrame
liancheng Mar 11, 2016
99b7187
[SPARK-13780][SQL] Add missing dependency to build.
Mar 11, 2016
eb650a8
[STREAMING][MINOR] Fix a duplicate "be" in comments
lw-lin Mar 11, 2016
ff776b2
[SPARK-13328][CORE] Poor read performance for broadcast variables wit…
nezihyigitbasi Mar 11, 2016
073bf9d
[SPARK-13807] De-duplicate `Python*Helper` instantiation code in PySp…
JoshRosen Mar 11, 2016
42afd72
[SPARK-13814] [PYSPARK] Delete unnecessary imports in python examples…
zhengruifeng Mar 11, 2016
66d9d0e
[SPARK-13139][SQL] Parse Hive DDL commands ourselves
Mar 11, 2016
2ef4c59
[SPARK-13830] prefer block manager than direct result for large result
Mar 11, 2016
ba8c86d
[SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RD…
Mar 12, 2016
4eace4d
[SPARK-13828][SQL] Bring back stack trace of AnalysisException thrown…
liancheng Mar 12, 2016
c079420
[SPARK-13841][SQL] Removes Dataset.collectRows()/takeRows()
liancheng Mar 13, 2016
db88d02
[MINOR][DOCS] Replace `DataFrame` with `Dataset` in Javadoc.
dongjoon-hyun Mar 13, 2016
515e4af
[SPARK-13810][CORE] Add Port Configuration Suggestions on Bind Except…
bjornjon Mar 13, 2016
c7e68c3
[SPARK-13812][SPARKR] Fix SparkR lint-r test errors.
Mar 13, 2016
f3daa09
[SQL] fix typo in DataSourceRegister
Mar 14, 2016
473263f
[SPARK-13834][BUILD] Update sbt and sbt plugins for 2.x.
dongjoon-hyun Mar 14, 2016
1840852
[SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String …
srowen Mar 14, 2016
e58fa19
Closes #11668
rxin Mar 14, 2016
acdf219
[MINOR][DOCS] Fix more typos in comments/strings.
dongjoon-hyun Mar 14, 2016
31d069d
[SPARK-13746][TESTS] stop using deprecated SynchronizedSet
Mar 14, 2016
250832c
[SPARK-13207][SQL] Make partitioning discovery ignore _SUCCESS files.
yhuai Mar 14, 2016
9a1680c
[SPARK-13139][SQL] Follow-ups to #11573
Mar 14, 2016
9a87afd
[SPARK-13833] Guard against race condition when re-caching disk block…
JoshRosen Mar 14, 2016
45f8053
[SPARK-13578][CORE] Modify launch scripts to not use assemblies.
Mar 14, 2016
63f642a
[SPARK-13779][YARN] Avoid cancelling non-local container requests.
rdblue Mar 14, 2016
6a4bfcd
[SPARK-13658][SQL] BooleanSimplification rule is slow with large bool…
viirya Mar 14, 2016
07cb323
[SPARK-13848][SPARK-5185] Update to Py4J 0.9.2 in order to fix classl…
JoshRosen Mar 14, 2016
310981d
[SPARK-12583][MESOS] Mesos shuffle service: Don't delete shuffle file…
Mar 14, 2016
9f13f0f
[MINOR][DOCS] Added Missing back slashes
danielsan Mar 14, 2016
e06493c
[MINOR][COMMON] Fix copy-paste oversight in variable naming
bjornjon Mar 14, 2016
23385e8
[SPARK-13054] Always post TaskEnd event for tasks
Mar 14, 2016
a48296f
[SPARK-13686][MLLIB][STREAMING] Add a constructor parameter `reqParam…
dongjoon-hyun Mar 14, 2016
38529d8
[SPARK-10907][SPARK-6157] Remove pendingUnrollMemory from MemoryStore
JoshRosen Mar 14, 2016
8301fad
[SPARK-13626][CORE] Avoid duplicate config deprecation warnings.
Mar 14, 2016
06dec37
[SPARK-13843][STREAMING] Remove streaming-flume, streaming-mqtt, stre…
zsxwing Mar 14, 2016
992142b
[SPARK-11826][MLLIB] Refactor add() and subtract() methods
ehsanmok Mar 15, 2016
17eec0a
[SPARK-13664][SQL] Add a strategy for planning partitioned and bucket…
marmbrus Mar 15, 2016
4bf4609
[SPARK-13882][SQL] Remove org.apache.spark.sql.execution.local
rxin Mar 15, 2016
8e0b030
[SPARK-10380][SQL] Fix confusing documentation examples for astype/dr…
rxin Mar 15, 2016
b5e3bd8
[SPARK-13791][SQL] Add MetadataLog and HDFSMetadataLog
zsxwing Mar 15, 2016
e76679a
[SPARK-13880][SPARK-13881][SQL] Rename DataFrame.scala Dataset.scala,…
rxin Mar 15, 2016
9256840
[SPARK-13661][SQL] avoid the copy in HashedRelation
Mar 15, 2016
f72743d
[SPARK-13353][SQL] fast serialization for collecting DataFrame/Dataset
Mar 15, 2016
e649580
[SPARK-13884][SQL] Remove DescribeCommand's dependency on LogicalPlan
rxin Mar 15, 2016
43304b1
[SPARK-13888][DOC] Remove Akka Receiver doc and refer to the DStream …
zsxwing Mar 15, 2016
a51f877
[SPARK-13870][SQL] Add scalastyle escaping correctly in CVSSuite.scala
dongjoon-hyun Mar 15, 2016
276c2d5
[SPARK-13890][SQL] Remove some internal classes' dependency on SQLCon…
rxin Mar 15, 2016
99bd2f0
[SPARK-13840][SQL] Split Optimizer Rule ColumnPruning to ColumnPrunin…
gatorsmile Mar 15, 2016
10251a7
[SPARK-13660][SQL][TESTS] ContinuousQuerySuite floods the logs with g…
keypointt Mar 15, 2016
dafd70f
[SPARK-12379][ML][MLLIB] Copy GBT implementation to spark.ml
sethah Mar 15, 2016
bd5365b
[SPARK-13803] restore the changes in SPARK-3411
CodingCat Mar 15, 2016
48978ab
[SPARK-13576][BUILD] Don't create assembly for examples.
Mar 15, 2016
5e6f2f4
[SPARK-13893][SQL] Remove SQLContext.catalog/analyzer (internal method)
rxin Mar 15, 2016
d89c714
[SPARK-13642][YARN] Changed the default application exit state to fai…
jerryshao Mar 15, 2016
50e3644
[SPARK-13896][SQL][STRING] Dataset.toJSON should return Dataset
Mar 15, 2016
dddf2f2
[MINOR] a minor fix for the comments of a method in RPC Dispatcher
CodingCat Mar 15, 2016
41eaabf
[SPARK-13626][CORE] Revert change to SparkConf's constructor.
Mar 15, 2016
643649d
[SPARK-13895][SQL] DataFrameReader.text should return Dataset[String]
rxin Mar 15, 2016
bbd887f
[SPARK-13918][SQL] Merge SortMergeJoin and SortMergerOuterJoin
Mar 16, 2016
52b6a89
[MINOR][TEST][SQL] Remove wrong "expected" parameter in checkNaNWitho…
Mar 16, 2016
421f6c2
[SPARK-13917] [SQL] generate broadcast semi join
Mar 16, 2016
3665294
[SPARK-9837][ML] R-like summary statistics for GLMs via iteratively r…
yanboliang Mar 16, 2016
3c578c5
[SPARK-13920][BUILD] MIMA checks should apply to @Experimental and @D…
dongjoon-hyun Mar 16, 2016
9202479
[SPARK-13899][SQL] Produce InternalRow instead of external Row at CSV…
HyukjinKwon Mar 16, 2016
431a3d0
[SPARK-12653][SQL] Re-enable test "SPARK-8489: MissingRequirementErro…
dongjoon-hyun Mar 16, 2016
05ab294
[SPARK-13906] Ensure that there are at least 2 dispatcher threads.
yonran Mar 16, 2016
3b461d9
[SPARK-13823][SPARK-13397][SPARK-13395][CORE] More warnings, Standard…
srowen Mar 16, 2016
56d8824
[SPARK-13396] Stop using our internal deprecated .metrics on Exceptio…
GayathriMurali Mar 16, 2016
1d95fb6
[SPARK-13793][CORE] PipedRDD doesn't propagate exceptions while readi…
tejasapatil Mar 16, 2016
496d2a2
[SPARK-13889][YARN] Fix integer overflow when calculating the max num…
carsonwang Mar 16, 2016
9412547
[SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succe…
srowen Mar 16, 2016
5f6bdf9
[SPARK-13281][CORE] Switch broadcast of RDD to exception from warning
Mar 16, 2016
eacd9d8
[SPARK-13360][PYSPARK][YARN] PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON…
zjffdu Mar 16, 2016
d9e8f26
[SPARK-13924][SQL] officially support multi-insert
cloud-fan Mar 16, 2016
d9670f8
[SPARK-13894][SQL] SqlContext.range return type from DataFrame to Dat…
chenghao-intel Mar 16, 2016
9198497
[SPARK-13816][GRAPHX] Add parameter checks for algorithms in Graphx
zhengruifeng Mar 16, 2016
1d1de28
[SPARK-13827][SQL] Can't add subquery to an operator with same-name o…
cloud-fan Mar 16, 2016
c4bd576
[SPARK-12721][SQL] SQL Generation for Script Transformation
gatorsmile Mar 16, 2016
ae6c677
[SPARK-13038][PYSPARK] Add load/save to pipeline
yinxusen Mar 16, 2016
3f06eb7
[SPARK-13613][ML] Provide ignored tests to export test dataset into C…
yanboliang Mar 16, 2016
6fc2b65
[SPARK-11888][ML] Decision tree persistence in spark.ml
jkbradley Mar 16, 2016
85c42fd
[SPARK-13927][MLLIB] add row/column iterator to local matrices
mengxr Mar 16, 2016
27e1f38
[SPARK-13034] PySpark ml.classification support export/import
GayathriMurali Mar 16, 2016
4ce2d24
[SPARK-13942][CORE][DOCS] Remove Shark-related docs for 2.x
dongjoon-hyun Mar 16, 2016
b90c020
[SPARK-13922][SQL] Filter rows with null attributes in vectorized par…
sameeragarwal Mar 16, 2016
f96997b
[SPARK-13871][SQL] Support for inferring filters from data constraints
sameeragarwal Mar 16, 2016
77ba302
[SPARK-13869][SQL] Remove redundant conditions while combining filters
sameeragarwal Mar 16, 2016
d4d8493
[SPARK-11011][SQL] Narrow type of UDT serialization
jodersky Mar 16, 2016
92b7057
[SPARK-13761][ML] Deprecate validateParams
hhbyyh Mar 17, 2016
ca9ef86
[SPARK-13923][SQL] Implement SessionCatalog
Mar 17, 2016
917f400
[SPARK-13719][SQL] Parse JSON rows having an array type and a struct …
HyukjinKwon Mar 17, 2016
c100d31
[SPARK-13873] [SQL] Avoid copy of UnsafeRow when there is no join in …
Mar 17, 2016
7eef246
[SPARK-13118][SQL] Expression encoding for optional synthetic classes
jodersky Mar 17, 2016
c890c35
[MINOR][SQL][BUILD] Remove duplicated lines
dongjoon-hyun Mar 17, 2016
d1c193a
[SPARK-12855][MINOR][SQL][DOC][TEST] remove spark.sql.dialect from do…
adrian-wang Mar 17, 2016
de1a84e
[SPARK-13926] Automatically use Kryo serializer when shuffling RDDs w…
JoshRosen Mar 17, 2016
5faba9f
[SPARK-13403][SQL] Pass hadoopConfiguration to HiveConf constructors.
rdblue Mar 17, 2016
82066a1
[SPARK-13948] MiMa check should catch if the visibility changes to pr…
JoshRosen Mar 17, 2016
30c1884
Revert "[SPARK-13840][SQL] Split Optimizer Rule ColumnPruning to Colu…
davies Mar 17, 2016
204c9de
[MINOR][DOC] Add JavaStreamingTestExample
zhengruifeng Mar 17, 2016
357d82d
[SPARK-13629][ML] Add binary toggle Param to CountVectorizer
hhbyyh Mar 17, 2016
ea9ca6f
[SPARK-13901][CORE] correct the logDebug information when jump to the…
trueyao Mar 17, 2016
8ef3399
[SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.int…
cloud-fan Mar 17, 2016
1974d1d
[SPARK-12719][SQL] SQL generation support for Generate
cloud-fan Mar 17, 2016
65b75e6
[SPARK-13776][WEBUI] Limit the max number of acceptors and selectors …
zsxwing Mar 17, 2016
637a78f
[SPARK-13427][SQL] Support USING clause in JOIN.
dilipbiswal Mar 17, 2016
5f3bda6
[SPARK-13838] [SQL] Clear variable code to prevent it to be re-evalua…
viirya Mar 17, 2016
3ee7996
[SPARK-12719][HOTFIX] Fix compilation against Scala 2.10
tedyu Mar 17, 2016
828213d
[SPARK-13937][PYSPARK][ML] Change JavaWrapper _java_obj from static t…
BryanCutler Mar 17, 2016
edf8b87
[SPARK-11891] Model export/import for RFormula and RFormulaModel
yinxusen Mar 17, 2016
4c08e2c
Revert "[SPARK-12719][HOTFIX] Fix compilation against Scala 2.10"
yhuai Mar 17, 2016
b39e80d
[SPARK-13761][ML] Remove remaining uses of validateParams
jkbradley Mar 17, 2016
1614485
[SPARK-10788][MLLIB][ML] Remove duplicate bins for decision trees
sethah Mar 17, 2016
453455c
[SPARK-13974][SQL] sub-query names do not need to be globally unique …
cloud-fan Mar 18, 2016
6037ed0
[SPARK-13976][SQL] do not remove sub-queries added by user when gener…
cloud-fan Mar 18, 2016
6c2d894
[SPARK-13921] Store serialized blocks as multiple chunks in MemoryStore
JoshRosen Mar 18, 2016
90a1d8d
[SPARK-12719][HOTFIX] Fix compilation against Scala 2.10
tedyu Mar 18, 2016
10ef4f3
[SPARK-13826][SQL] Revises Dataset ScalaDoc
liancheng Mar 18, 2016
750ed64
[SPARK-13930] [SQL] Apply fast serialization on collect limit operator
viirya Mar 18, 2016
bb1fda0
[SPARK-13826][SQL] Addendum: update documentation for Datasets
rxin Mar 18, 2016
7783b6f
[MINOR][ML] When trainingSummary is None, it should throw RuntimeExce…
yanboliang Mar 18, 2016
0f1015f
[SPARK-14001][SQL] support multi-children Union in SQLBuilder
cloud-fan Mar 18, 2016
53f32a2
[MINOR][DOC] Fix nits in JavaStreamingTestExample
zhengruifeng Mar 18, 2016
0acb32a
[SPARK-13972][SQ] hive tests should fail if SQL generation failed
cloud-fan Mar 18, 2016
14c7236
[SPARK-14004][SQL][MINOR] AttributeReference and Alias should only us…
liancheng Mar 18, 2016
9c23c81
[SPARK-13977] [SQL] Brings back Shuffled hash join
Mar 18, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[SPARK-13593] [SQL] improve the createDataFrame to accept data type…
… string and verify the data

## What changes were proposed in this pull request?

This PR improves the `createDataFrame` method to make it also accept datatype string, then users can convert python RDD to DataFrame easily, for example, `df = rdd.toDF("a: int, b: string")`.
It also supports flat schema so users can convert an RDD of int to DataFrame directly, we will automatically wrap int to row for users.
If schema is given, now we checks if the real data matches the given schema, and throw error if it doesn't.

## How was this patch tested?

new tests in `test.py` and doc test in `types.py`

Author: Wenchen Fan <[email protected]>

Closes apache#11444 from cloud-fan/pyrdd.
  • Loading branch information
cloud-fan authored and davies committed Mar 8, 2016
commit d57daf1f7732a7ac54a91fe112deeda0a254f9ef
68 changes: 54 additions & 14 deletions python/pyspark/sql/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@
from pyspark import since
from pyspark.rdd import RDD, ignore_unicode_prefix
from pyspark.serializers import AutoBatchedSerializer, PickleSerializer
from pyspark.sql.types import Row, StringType, StructType, _verify_type, \
_infer_schema, _has_nulltype, _merge_type, _create_converter
from pyspark.sql.types import Row, DataType, StringType, StructType, _verify_type, \
_infer_schema, _has_nulltype, _merge_type, _create_converter, _parse_datatype_string
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.readwriter import DataFrameReader
from pyspark.sql.utils import install_exception_handler
Expand Down Expand Up @@ -301,11 +301,6 @@ def _createFromLocal(self, data, schema):
Create an RDD for DataFrame from an list or pandas.DataFrame, returns
the RDD and schema.
"""
if has_pandas and isinstance(data, pandas.DataFrame):
if schema is None:
schema = [str(x) for x in data.columns]
data = [r.tolist() for r in data.to_records(index=False)]

# make sure data could consumed multiple times
if not isinstance(data, list):
data = list(data)
Expand Down Expand Up @@ -333,8 +328,7 @@ def _createFromLocal(self, data, schema):
@ignore_unicode_prefix
def createDataFrame(self, data, schema=None, samplingRatio=None):
"""
Creates a :class:`DataFrame` from an :class:`RDD` of :class:`tuple`/:class:`list`,
list or :class:`pandas.DataFrame`.
Creates a :class:`DataFrame` from an :class:`RDD`, a list or a :class:`pandas.DataFrame`.

When ``schema`` is a list of column names, the type of each column
will be inferred from ``data``.
Expand All @@ -343,15 +337,29 @@ def createDataFrame(self, data, schema=None, samplingRatio=None):
from ``data``, which should be an RDD of :class:`Row`,
or :class:`namedtuple`, or :class:`dict`.

When ``schema`` is :class:`DataType` or datatype string, it must match the real data, or
exception will be thrown at runtime. If the given schema is not StructType, it will be
wrapped into a StructType as its only field, and the field name will be "value", each record
will also be wrapped into a tuple, which can be converted to row later.

If schema inference is needed, ``samplingRatio`` is used to determined the ratio of
rows used for schema inference. The first row will be used if ``samplingRatio`` is ``None``.

:param data: an RDD of :class:`Row`/:class:`tuple`/:class:`list`/:class:`dict`,
:class:`list`, or :class:`pandas.DataFrame`.
:param schema: a :class:`StructType` or list of column names. default None.
:param data: an RDD of any kind of SQL data representation(e.g. row, tuple, int, boolean,
etc.), or :class:`list`, or :class:`pandas.DataFrame`.
:param schema: a :class:`DataType` or a datatype string or a list of column names, default
is None. The data type string format equals to `DataType.simpleString`, except that
top level struct type can omit the `struct<>` and atomic types use `typeName()` as
their format, e.g. use `byte` instead of `tinyint` for ByteType. We can also use `int`
as a short name for IntegerType.
:param samplingRatio: the sample ratio of rows used for inferring
:return: :class:`DataFrame`

.. versionchanged:: 2.0
The schema parameter can be a DataType or a datatype string after 2.0. If it's not a
StructType, it will be wrapped into a StructType and each record will also be wrapped
into a tuple.

>>> l = [('Alice', 1)]
>>> sqlContext.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
Expand Down Expand Up @@ -388,14 +396,46 @@ def createDataFrame(self, data, schema=None, samplingRatio=None):
[Row(name=u'Alice', age=1)]
>>> sqlContext.createDataFrame(pandas.DataFrame([[1, 2]])).collect() # doctest: +SKIP
[Row(0=1, 1=2)]

>>> sqlContext.createDataFrame(rdd, "a: string, b: int").collect()
[Row(a=u'Alice', b=1)]
>>> rdd = rdd.map(lambda row: row[1])
>>> sqlContext.createDataFrame(rdd, "int").collect()
[Row(value=1)]
>>> sqlContext.createDataFrame(rdd, "boolean").collect() # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
...
Py4JJavaError:...
"""
if isinstance(data, DataFrame):
raise TypeError("data is already a DataFrame")

if isinstance(schema, basestring):
schema = _parse_datatype_string(schema)

if has_pandas and isinstance(data, pandas.DataFrame):
if schema is None:
schema = [str(x) for x in data.columns]
data = [r.tolist() for r in data.to_records(index=False)]

if isinstance(schema, StructType):
def prepare(obj):
_verify_type(obj, schema)
return obj
elif isinstance(schema, DataType):
datatype = schema

def prepare(obj):
_verify_type(obj, datatype)
return (obj, )
schema = StructType().add("value", datatype)
else:
prepare = lambda obj: obj

if isinstance(data, RDD):
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
else:
rdd, schema = self._createFromLocal(data, schema)
rdd, schema = self._createFromLocal(map(prepare, data), schema)
jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
df = DataFrame(jdf, self)
Expand Down
40 changes: 37 additions & 3 deletions python/pyspark/sql/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -369,9 +369,7 @@ def test_create_dataframe_schema_mismatch(self):
rdd = self.sc.parallelize(range(3)).map(lambda i: Row(a=i))
schema = StructType([StructField("a", IntegerType()), StructField("b", StringType())])
df = self.sqlCtx.createDataFrame(rdd, schema)
message = ".*Input row doesn't have expected number of values required by the schema.*"
with self.assertRaisesRegexp(Exception, message):
df.show()
self.assertRaises(Exception, lambda: df.show())

def test_serialize_nested_array_and_map(self):
d = [Row(l=[Row(a=1, b='s')], d={"key": Row(c=1.0, d="2")})]
Expand Down Expand Up @@ -1178,6 +1176,42 @@ def test_functions_broadcast(self):
# planner should not crash without a join
broadcast(df1)._jdf.queryExecution().executedPlan()

def test_toDF_with_schema_string(self):
data = [Row(key=i, value=str(i)) for i in range(100)]
rdd = self.sc.parallelize(data, 5)

df = rdd.toDF("key: int, value: string")
self.assertEqual(df.schema.simpleString(), "struct<key:int,value:string>")
self.assertEqual(df.collect(), data)

# different but compatible field types can be used.
df = rdd.toDF("key: string, value: string")
self.assertEqual(df.schema.simpleString(), "struct<key:string,value:string>")
self.assertEqual(df.collect(), [Row(key=str(i), value=str(i)) for i in range(100)])

# field names can differ.
df = rdd.toDF(" a: int, b: string ")
self.assertEqual(df.schema.simpleString(), "struct<a:int,b:string>")
self.assertEqual(df.collect(), data)

# number of fields must match.
self.assertRaisesRegexp(Exception, "Length of object",
lambda: rdd.toDF("key: int").collect())

# field types mismatch will cause exception at runtime.
self.assertRaisesRegexp(Exception, "FloatType can not accept",
lambda: rdd.toDF("key: float, value: string").collect())

# flat schema values will be wrapped into row.
df = rdd.map(lambda row: row.key).toDF("int")
self.assertEqual(df.schema.simpleString(), "struct<value:int>")
self.assertEqual(df.collect(), [Row(key=i) for i in range(100)])

# users can use DataType directly instead of data type string.
df = rdd.map(lambda row: row.key).toDF(IntegerType())
self.assertEqual(df.schema.simpleString(), "struct<value:int>")
self.assertEqual(df.collect(), [Row(key=i) for i in range(100)])


class HiveContextSQLTests(ReusedPySparkTestCase):

Expand Down
129 changes: 123 additions & 6 deletions python/pyspark/sql/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -681,6 +681,129 @@ def __eq__(self, other):
for v in [ArrayType, MapType, StructType])


_FIXED_DECIMAL = re.compile("decimal\\(\\s*(\\d+)\\s*,\\s*(\\d+)\\s*\\)")


_BRACKETS = {'(': ')', '[': ']', '{': '}'}


def _parse_basic_datatype_string(s):
if s in _all_atomic_types.keys():
return _all_atomic_types[s]()
elif s == "int":
return IntegerType()
elif _FIXED_DECIMAL.match(s):
m = _FIXED_DECIMAL.match(s)
return DecimalType(int(m.group(1)), int(m.group(2)))
else:
raise ValueError("Could not parse datatype: %s" % s)


def _ignore_brackets_split(s, separator):
"""
Splits the given string by given separator, but ignore separators inside brackets pairs, e.g.
given "a,b" and separator ",", it will return ["a", "b"], but given "a<b,c>, d", it will return
["a<b,c>", "d"].
"""
parts = []
buf = ""
level = 0
for c in s:
if c in _BRACKETS.keys():
level += 1
buf += c
elif c in _BRACKETS.values():
if level == 0:
raise ValueError("Brackets are not correctly paired: %s" % s)
level -= 1
buf += c
elif c == separator and level > 0:
buf += c
elif c == separator:
parts.append(buf)
buf = ""
else:
buf += c

if len(buf) == 0:
raise ValueError("The %s cannot be the last char: %s" % (separator, s))
parts.append(buf)
return parts


def _parse_struct_fields_string(s):
parts = _ignore_brackets_split(s, ",")
fields = []
for part in parts:
name_and_type = _ignore_brackets_split(part, ":")
if len(name_and_type) != 2:
raise ValueError("The strcut field string format is: 'field_name:field_type', " +
"but got: %s" % part)
field_name = name_and_type[0].strip()
field_type = _parse_datatype_string(name_and_type[1])
fields.append(StructField(field_name, field_type))
return StructType(fields)


def _parse_datatype_string(s):
"""
Parses the given data type string to a :class:`DataType`. The data type string format equals
to `DataType.simpleString`, except that top level struct type can omit the `struct<>` and
atomic types use `typeName()` as their format, e.g. use `byte` instead of `tinyint` for
ByteType. We can also use `int` as a short name for IntegerType.

>>> _parse_datatype_string("int ")
IntegerType
>>> _parse_datatype_string("a: byte, b: decimal( 16 , 8 ) ")
StructType(List(StructField(a,ByteType,true),StructField(b,DecimalType(16,8),true)))
>>> _parse_datatype_string("a: array< short>")
StructType(List(StructField(a,ArrayType(ShortType,true),true)))
>>> _parse_datatype_string(" map<string , string > ")
MapType(StringType,StringType,true)

>>> # Error cases
>>> _parse_datatype_string("blabla") # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
...
ValueError:...
>>> _parse_datatype_string("a: int,") # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
...
ValueError:...
>>> _parse_datatype_string("array<int") # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
...
ValueError:...
>>> _parse_datatype_string("map<int, boolean>>") # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
...
ValueError:...
"""
s = s.strip()
if s.startswith("array<"):
if s[-1] != ">":
raise ValueError("'>' should be the last char, but got: %s" % s)
return ArrayType(_parse_datatype_string(s[6:-1]))
elif s.startswith("map<"):
if s[-1] != ">":
raise ValueError("'>' should be the last char, but got: %s" % s)
parts = _ignore_brackets_split(s[4:-1], ",")
if len(parts) != 2:
raise ValueError("The map type string format is: 'map<key_type,value_type>', " +
"but got: %s" % s)
kt = _parse_datatype_string(parts[0])
vt = _parse_datatype_string(parts[1])
return MapType(kt, vt)
elif s.startswith("struct<"):
if s[-1] != ">":
raise ValueError("'>' should be the last char, but got: %s" % s)
return _parse_struct_fields_string(s[7:-1])
elif ":" in s:
return _parse_struct_fields_string(s)
else:
return _parse_basic_datatype_string(s)


def _parse_datatype_json_string(json_string):
"""Parses the given data type JSON string.
>>> import pickle
Expand Down Expand Up @@ -730,9 +853,6 @@ def _parse_datatype_json_string(json_string):
return _parse_datatype_json_value(json.loads(json_string))


_FIXED_DECIMAL = re.compile("decimal\\((\\d+),(\\d+)\\)")


def _parse_datatype_json_value(json_value):
if not isinstance(json_value, dict):
if json_value in _all_atomic_types.keys():
Expand Down Expand Up @@ -940,9 +1060,6 @@ def convert_struct(obj):
return convert_struct


_BRACKETS = {'(': ')', '[': ']', '{': '}'}


def _split_schema_abstract(s):
"""
split the schema abstract into fields
Expand Down