Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
514 commits
Select commit Hold shift + click to select a range
b14bfc3
[SPARK-19993][SQL] Caching logical plans containing subquery expressi…
dilipbiswal Apr 12, 2017
b938438
[MINOR][DOCS] Fix spacings in Structured Streaming Programming Guide
dongjinleekr Apr 12, 2017
bca4259
[MINOR][DOCS] JSON APIs related documentation fixes
HyukjinKwon Apr 12, 2017
044f7ec
[SPARK-20298][SPARKR][MINOR] fixed spelling mistake "charactor"
bdwyer2 Apr 12, 2017
ffc57b0
[SPARK-20302][SQL] Short circuit cast when from and to types are stru…
rxin Apr 12, 2017
2e1fd46
[SPARK-20296][TRIVIAL][DOCS] Count distinct error message for streaming
jtoka Apr 12, 2017
ceaf77a
[SPARK-18692][BUILD][DOCS] Test Java 8 unidoc build on Jenkins
HyukjinKwon Apr 12, 2017
504e62e
[SPARK-20303][SQL] Rename createTempFunction to registerFunction
gatorsmile Apr 12, 2017
5408553
[SPARK-20304][SQL] AssertNotNull should not include path in string re…
rxin Apr 12, 2017
99a9473
[SPARK-19570][PYSPARK] Allow to disable hive in pyspark shell
zjffdu Apr 12, 2017
924c424
[SPARK-20301][FLAKY-TEST] Fix Hadoop Shell.runCommand flakiness in St…
brkyvz Apr 12, 2017
a7b430b
[SPARK-15354][FLAKY-TEST] TopologyAwareBlockReplicationPolicyBehavior…
cloud-fan Apr 13, 2017
c5f1cc3
[SPARK-20131][CORE] Don't use `this` lock in StandaloneSchedulerBacke…
zsxwing Apr 13, 2017
ec68d8f
[SPARK-20189][DSTREAM] Fix spark kinesis testcases to remove deprecat…
yashs360 Apr 13, 2017
095d1cb
[SPARK-20265][MLLIB] Improve Prefix'span pre-processing efficiency
Syrux Apr 13, 2017
a4293c2
[SPARK-20284][CORE] Make {Des,S}erializationStream extend Closeable
Apr 13, 2017
fbe4216
[SPARK-20233][SQL] Apply star-join filter heuristics to dynamic progr…
ioana-delaney Apr 13, 2017
8ddf0d2
[SPARK-20232][PYTHON] Improve combineByKey docs
Apr 13, 2017
7536e28
[SPARK-20038][SQL] FileFormatWriter.ExecuteWriteTask.releaseResources…
steveloughran Apr 13, 2017
fb036c4
[SPARK-20318][SQL] Use Catalyst type for min/max in ColumnStat for ea…
Apr 14, 2017
98b41ec
[SPARK-20316][SQL] Val and Var should strictly follow the Scala syntax
Apr 15, 2017
35e5ae4
[SPARK-19716][SQL][FOLLOW-UP] UnresolvedMapObjects should always be s…
cloud-fan Apr 16, 2017
e090f3c
[SPARK-20335][SQL] Children expressions of Hive UDF impacts the deter…
gatorsmile Apr 16, 2017
a888fed
[SPARK-19740][MESOS] Add support in Spark to pass arbitrary parameter…
Apr 16, 2017
ad935f5
[SPARK-20343][BUILD] Add avro dependency in core POM to resolve build…
HyukjinKwon Apr 16, 2017
86d251c
[SPARK-20278][R] Disable 'multiple_dots_linter' lint rule that is aga…
HyukjinKwon Apr 16, 2017
24f09b3
[SPARK-19828][R][FOLLOWUP] Rename asJsonArray to as.json.array in fro…
HyukjinKwon Apr 17, 2017
01ff035
[SPARK-20349][SQL] ListFunctions returns duplicate functions after us…
gatorsmile Apr 17, 2017
e5fee3e
[SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patterns.
jodersky Apr 17, 2017
0075562
Typo fix: distitrbuted -> distributed
ash211 Apr 18, 2017
33ea908
[TEST][MINOR] Replace repartitionBy with distribute in CollapseRepart…
jaceklaskowski Apr 18, 2017
b0a1e93
[SPARK-17647][SQL][FOLLOWUP][MINOR] fix typo
felixcheung Apr 18, 2017
07fd94e
[SPARK-20344][SCHEDULER] Duplicate call in FairSchedulableBuilder.add…
snazy Apr 18, 2017
d4f10cb
[SPARK-20343][BUILD] Force Avro 1.7.7 in sbt build to resolve build f…
HyukjinKwon Apr 18, 2017
321b4f0
[SPARK-20366][SQL] Fix recursive join reordering: inside joins are no…
Apr 18, 2017
1f81dda
[SPARK-20354][CORE][REST-API] When I request access to the 'http: //i…
Apr 18, 2017
f654b39
[SPARK-20360][PYTHON] reprs for interpreters
rgbkrk Apr 18, 2017
74aa0df
[SPARK-20377][SS] Fix JavaStructuredSessionization example
tdas Apr 18, 2017
e468a96
[SPARK-20254][SQL] Remove unnecessary data conversion for Dataset wit…
kiszk Apr 19, 2017
702d85a
[SPARK-20208][R][DOCS] Document R fpGrowth support
zero323 Apr 19, 2017
608bf30
[SPARK-20359][SQL] Avoid unnecessary execution in EliminateOuterJoin …
koertkuipers Apr 19, 2017
773754b
[SPARK-20356][SQL] Pruned InMemoryTableScanExec should have correct o…
viirya Apr 19, 2017
3537876
[SPARK-20343][BUILD] Avoid Unidoc build only if Hadoop 2.6 is explici…
HyukjinKwon Apr 19, 2017
71a8e9d
[SPARK-20036][DOC] Note incompatible dependencies on org.apache.kafka…
koeninger Apr 19, 2017
4fea784
[SPARK-20397][SPARKR][SS] Fix flaky test: test_streaming.R.Terminated…
zsxwing Apr 19, 2017
63824b2
[SPARK-20350] Add optimization rules to apply Complementation Laws.
ptkool Apr 20, 2017
39e303a
[MINOR][SS] Fix a missing space in UnsupportedOperationChecker error …
zsxwing Apr 20, 2017
dd6d55d
[SPARK-20398][SQL] range() operator should include cancellation reaso…
ericl Apr 20, 2017
bdc6056
Fixed typos in docs
Apr 20, 2017
46c5749
[SPARK-20375][R] R wrappers for array and map
zero323 Apr 20, 2017
55bea56
[SPARK-20156][SQL][FOLLOW-UP] Java String toLowerCase "Turkish locale…
gatorsmile Apr 20, 2017
c6f62c5
[SPARK-20405][SQL] Dataset.withNewExecutionId should be private
rxin Apr 20, 2017
b91873d
[SPARK-20409][SQL] fail early if aggregate function in GROUP BY
cloud-fan Apr 20, 2017
c5a31d1
[SPARK-20407][TESTS] ParquetQuerySuite 'Enabling/disabling ignoreCorr…
bogdanrdc Apr 20, 2017
b2ebadf
[SPARK-20358][CORE] Executors failing stage on interrupted exception …
ericl Apr 20, 2017
d95e4d9
[SPARK-20334][SQL] Return a better error message when correlated pred…
dilipbiswal Apr 20, 2017
0332063
[SPARK-20410][SQL] Make sparkConf a def in SharedSQLContext
hvanhovell Apr 20, 2017
592f5c8
[SPARK-20172][CORE] Add file permission check when listing files in F…
jerryshao Apr 20, 2017
0368eb9
[SPARK-20367] Properly unescape column names of partitioning columns …
juliuszsompolski Apr 21, 2017
760c8d0
[SPARK-20329][SQL] Make timezone aware expression without timezone un…
hvanhovell Apr 21, 2017
48d760d
[SPARK-20281][SQL] Print the identical Range parameters of SparkConte…
maropu Apr 21, 2017
e2b3d23
[SPARK-20420][SQL] Add events to the external catalog
hvanhovell Apr 21, 2017
3476799
Small rewording about history server use case
dud225 Apr 21, 2017
c9e6035
[SPARK-20412] Throw ParseException from visitNonOptionalPartitionSpec…
juliuszsompolski Apr 21, 2017
a750a59
[SPARK-20341][SQL] Support BigInt's value that does not fit in long v…
kiszk Apr 21, 2017
eb00378
[SPARK-20423][ML] fix MLOR coeffs centering when reg == 0
WeichenXu123 Apr 21, 2017
fd648bf
[SPARK-20371][R] Add wrappers for collect_list and collect_set
zero323 Apr 21, 2017
ad29040
[SPARK-20401][DOC] In the spark official configuration document, the …
Apr 21, 2017
05a4514
[SPARK-20386][SPARK CORE] modify the log info if the block exists on …
eatoncys Apr 22, 2017
b3c572a
[SPARK-20430][SQL] Initialise RangeExec parameters in a driver side
maropu Apr 22, 2017
8765bc1
[SPARK-20132][DOCS] Add documentation for column string functions
map222 Apr 23, 2017
2eaf4f3
[SPARK-20385][WEB-UI] Submitted Time' field, the date format needs to…
Apr 23, 2017
e9f9715
[BUILD] Close stale PRs
maropu Apr 24, 2017
776a2c0
[SPARK-20439][SQL] Fix Catalog API listTables and getTable when faile…
gatorsmile Apr 24, 2017
90264ac
[SPARK-18901][ML] Require in LR LogisticAggregator is redundant
wangmiao1981 Apr 24, 2017
8a272dd
[SPARK-20438][R] SparkR wrappers for split and repeat
zero323 Apr 24, 2017
5280d93
[SPARK-20239][CORE] Improve HistoryServer's ACL mechanism
jerryshao Apr 25, 2017
f44c8a8
[SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT
JoshRosen Apr 25, 2017
31345fd
[SPARK-20451] Filter out nested mapType datatypes from sort order in …
sameeragarwal Apr 25, 2017
c8f1219
[SPARK-20455][DOCS] Fix Broken Docker IT Docs
original-brownbear Apr 25, 2017
0bc7a90
[SPARK-20404][CORE] Using Option(name) instead of Some(name)
szhem Apr 25, 2017
387565c
[SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redun…
wangmiao1981 Apr 25, 2017
67eef47
[SPARK-20449][ML] Upgrade breeze version to 0.13.1
yanboliang Apr 25, 2017
0a7f5f2
[SPARK-5484][GRAPHX] Periodically do checkpoint in Pregel
Apr 25, 2017
caf3920
[SPARK-18127] Add hooks and extension points to Spark
sameeragarwal Apr 26, 2017
57e1da3
[SPARK-16548][SQL] Inconsistent error handling in JSON parsing SQL fu…
Apr 26, 2017
df58a95
[SPARK-20437][R] R wrappers for rollup and cube
zero323 Apr 26, 2017
7a36525
[SPARK-20400][DOCS] Remove References to 3rd Party Vendor Tools
Apr 26, 2017
7fecf51
[SPARK-19812] YARN shuffle service fails to relocate recovery DB acro…
tgravescs Apr 26, 2017
dbb06c6
[MINOR][ML] Fix some PySpark & SparkR flaky tests
yanboliang Apr 26, 2017
66dd5b8
[SPARK-20391][CORE] Rename memory related fields in ExecutorSummay
jerryshao Apr 26, 2017
99c6cf9
[SPARK-20473] Enabling missing types in ColumnVector.Array
michal-databricks Apr 26, 2017
a277ae8
[SPARK-20474] Fixing OnHeapColumnVector reallocation
michal-databricks Apr 26, 2017
2ba1eba
[SPARK-12868][SQL] Allow adding jars from hdfs
weiqingy Apr 26, 2017
66636ef
[SPARK-20435][CORE] More thorough redaction of sensitive information
markgrover Apr 27, 2017
b4724db
[SPARK-20425][SQL] Support a vertical display mode for Dataset.show
maropu Apr 27, 2017
b58cf77
[DOCS][MINOR] Add missing since to SparkR repeat_string note.
zero323 Apr 27, 2017
ba76662
[SPARK-20208][DOCS][FOLLOW-UP] Add FP-Growth to SparkR programming guide
zero323 Apr 27, 2017
7633933
[SPARK-20483] Mesos Coarse mode may starve other Mesos frameworks
dgshep Apr 27, 2017
561e9cc
[SPARK-20421][CORE] Mark internal listeners as deprecated.
Apr 27, 2017
85c6ce6
[SPARK-20426] Lazy initialization of FileSegmentManagedBuffer for shu…
Apr 27, 2017
26ac2ce
[SPARK-20482][SQL] Resolving Casts is too strict on having time zone set
rednaxelafx Apr 27, 2017
a4aa466
[SPARK-20487][SQL] `HiveTableScan` node is quite verbose in explained…
tejasapatil Apr 27, 2017
039e32c
[SPARK-20483][MINOR] Test for Mesos Coarse mode may starve other Meso…
dgshep Apr 27, 2017
606432a
[SPARK-20047][ML] Constrained Logistic Regression
yanboliang Apr 27, 2017
01c999e
[SPARK-20461][CORE][SS] Use UninterruptibleThread for Executor and fi…
zsxwing Apr 27, 2017
823baca
[SPARK-20452][SS][KAFKA] Fix a potential ConcurrentModificationExcept…
zsxwing Apr 27, 2017
b90bf52
[SPARK-12837][CORE] Do not send the name of internal accumulator to e…
cloud-fan Apr 28, 2017
7fe8249
[SPARKR][DOC] Document LinearSVC in R programming guide
wangmiao1981 Apr 28, 2017
e3c8160
[SPARK-20476][SQL] Block users to create a table that use commas in t…
gatorsmile Apr 28, 2017
59e3a56
[SPARK-14471][SQL] Aliases in SELECT could be used in GROUP BY
maropu Apr 28, 2017
8c911ad
[SPARK-20465][CORE] Throws a proper exception when any temp directory…
HyukjinKwon Apr 28, 2017
733b81b
[SPARK-20496][SS] Bug in KafkaWriter Looks at Unanalyzed Plans
Apr 28, 2017
5d71f3d
[SPARK-20514][CORE] Upgrade Jetty to 9.3.11.v20160721
markgrover Apr 28, 2017
ebff519
[SPARK-20471] Remove AggregateBenchmark testsuite warning: Two level …
heary-cao Apr 28, 2017
77bcd77
[SPARK-19525][CORE] Add RDD checkpoint compression support
Apr 28, 2017
814a61a
[SPARK-20487][SQL] Display `serde` for `HiveTableScan` node in explai…
tejasapatil Apr 29, 2017
b28c3bc
[SPARK-20477][SPARKR][DOC] Document R bisecting k-means in R programm…
wangmiao1981 Apr 29, 2017
add9d1b
[SPARK-19791][ML] Add doc and example for fpgrowth
YY-OnCall Apr 29, 2017
ee694cd
[SPARK-20533][SPARKR] SparkR Wrappers Model should be private and val…
wangmiao1981 Apr 29, 2017
70f1bcd
[SPARK-20493][R] De-duplicate parse logics for DDL-like type strings …
HyukjinKwon Apr 29, 2017
d228cd0
[SPARK-20442][PYTHON][DOCS] Fill up documentations for functions in C…
HyukjinKwon Apr 29, 2017
4d99b95
[SPARK-20521][DOC][CORE] The default of 'spark.worker.cleanup.appData…
Apr 30, 2017
1ee494d
[SPARK-20492][SQL] Do not print empty parentheses for invalid primiti…
HyukjinKwon Apr 30, 2017
ae3df4e
[SPARK-20535][SPARKR] R wrappers for explode_outer and posexplode_outer
zero323 Apr 30, 2017
6613046
[MINOR][DOCS][PYTHON] Adding missing boolean type for replacement val…
May 1, 2017
80e9cf1
[SPARK-20490][SPARKR] Add R wrappers for eqNullSafe and ! / not
zero323 May 1, 2017
a355b66
[SPARK-20541][SPARKR][SS] support awaitTermination without timeout
felixcheung May 1, 2017
f0169a1
[SPARK-20290][MINOR][PYTHON][SQL] Add PySpark wrapper for eqNullSafe
zero323 May 1, 2017
6b44c4d
[SPARK-20534][SQL] Make outer generate exec return empty rows
hvanhovell May 1, 2017
ab30590
[SPARK-20517][UI] Fix broken history UI download link
jerryshao May 1, 2017
6fc6cf8
[SPARK-20464][SS] Add a job group and description for streaming queri…
kunalkhamar May 1, 2017
2b2dd08
[SPARK-20540][CORE] Fix unstable executor requests.
rdblue May 1, 2017
af726cd
[SPARK-20459][SQL] JdbcUtils throws IllegalStateException: Cause alre…
srowen May 2, 2017
259860d
[SPARK-20463] Add support for IS [NOT] DISTINCT FROM.
ptkool May 2, 2017
943a684
[SPARK-20548] Disable ReplSuite.newProductSeqEncoder with REPL define…
sameeragarwal May 2, 2017
d20a976
[SPARK-20192][SPARKR][DOC] SparkR migration guide to 2.2.0
felixcheung May 2, 2017
90d77e9
[SPARK-20532][SPARKR] Implement grouping and grouping_id
zero323 May 2, 2017
afb21bf
[SPARK-20537][CORE] Fixing OffHeapColumnVector reallocation
kiszk May 2, 2017
86174ea
[SPARK-20549] java.io.CharConversionException: Invalid UTF-32' in Jso…
brkyvz May 2, 2017
e300a5a
[SPARK-20300][ML][PYSPARK] Python API for ALSModel.recommendForAllUse…
May 2, 2017
b1e639a
[SPARK-19235][SQL][TEST][FOLLOW-UP] Enable Test Cases in DDLSuite wit…
gatorsmile May 2, 2017
13f47dc
[SPARK-20490][SPARKR][DOC] add family tag for not function
felixcheung May 2, 2017
ef3df91
[SPARK-20421][CORE] Add a missing deprecation tag.
May 2, 2017
b946f31
[SPARK-20558][CORE] clear InheritableThreadLocal variables in SparkCo…
cloud-fan May 3, 2017
6235132
[SPARK-20567] Lazily bind in GenerateExec
marmbrus May 3, 2017
db2fb84
[SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and P…
MechCoder May 3, 2017
16fab6b
[SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release
srowen May 3, 2017
7f96f2d
[SPARK-16957][MLLIB] Use midpoints for split values.
facaiy May 3, 2017
27f543b
[SPARK-20441][SPARK-20432][SS] Within the same streaming query, one S…
lw-lin May 3, 2017
527fc5d
[SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame
rxin May 3, 2017
6b9e49d
[SPARK-19965][SS] DataFrame batch reader may fail to infer partitions…
lw-lin May 3, 2017
13eb37c
[MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated …
HyukjinKwon May 3, 2017
02bbe73
[SPARK-20584][PYSPARK][SQL] Python generic hint support
zero323 May 4, 2017
fc472bd
[SPARK-20543][SPARKR] skip tests when running on CRAN
felixcheung May 4, 2017
b8302cc
[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streami…
felixcheung May 4, 2017
9c36aa2
[SPARK-20585][SPARKR] R generic hint support
zero323 May 4, 2017
f21897f
[SPARK-20544][SPARKR] R wrapper for input_file_name
zero323 May 4, 2017
57b6470
[SPARK-20571][SPARKR][SS] Flaky Structured Streaming tests
felixcheung May 4, 2017
c5dceb8
[SPARK-20047][FOLLOWUP][ML] Constrained Logistic Regression follow up
yanboliang May 4, 2017
bfc8c79
[SPARK-20566][SQL] ColumnVector should support `appendFloats` for array
dongjoon-hyun May 4, 2017
0d16faa
[SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column
May 5, 2017
4411ac7
[INFRA] Close stale PRs
HyukjinKwon May 5, 2017
37cdf07
[SPARK-19660][SQL] Replace the deprecated property name fs.default.na…
wangyum May 5, 2017
5773ab1
[SPARK-20546][DEPLOY] spark-class gets syntax error in posix mode
jyu00 May 5, 2017
9064f1b
[SPARK-20495][SQL][CORE] Add StorageLevel to cacheTable API
phatak-dev May 5, 2017
b9ad2d1
[SPARK-20613] Remove excess quotes in Windows executable
jarrettmeyer May 5, 2017
41439fd
[SPARK-20381][SQL] Add SQL metrics of numOutputRows for ObjectHashAgg…
May 5, 2017
bd57882
[SPARK-20603][SS][TEST] Set default number of topic partitions to 1 t…
zsxwing May 5, 2017
b31648c
[SPARK-20557][SQL] Support for db column type TIMESTAMP WITH TIME ZONE
JannikArndt May 5, 2017
5d75b14
[SPARK-20616] RuleExecutor logDebug of batch results should show diff…
juliuszsompolski May 5, 2017
b433aca
[SPARK-20614][PROJECT INFRA] Use the same log4j configuration with Je…
HyukjinKwon May 6, 2017
cafca54
[SPARK-20557][SQL] Support JDBC data type Time with Time Zone
gatorsmile May 7, 2017
63d90e7
[SPARK-18777][PYTHON][SQL] Return UDF from udf.register
zero323 May 7, 2017
37f963a
[SPARK-20518][CORE] Supplement the new blockidsuite unit tests
heary-cao May 7, 2017
88e6d75
[SPARK-20484][MLLIB] Add documentation to ALS code
danielyli May 7, 2017
2cf83c4
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object s…
steveloughran May 7, 2017
7087e01
[SPARK-20543][SPARKR][FOLLOWUP] Don't skip tests on AppVeyor
felixcheung May 7, 2017
500436b
[MINOR][SQL][DOCS] Improve unix_timestamp's scaladoc (and typo hunting)
jaceklaskowski May 7, 2017
1f73d35
[SPARK-20550][SPARKR] R wrapper for Dataset.alias
zero323 May 7, 2017
f53a820
[SPARK-16931][PYTHON][SQL] Add Python wrapper for bucketBy
zero323 May 8, 2017
2269155
[SPARK-12297][SQL] Hive compatibility for Parquet Timestamps
squito May 8, 2017
c24bdaa
[SPARK-20626][SPARKR] address date test warning with timezone on windows
felixcheung May 8, 2017
42cc6d1
[SPARK-20380][SQL] Unable to set/unset table comment property using A…
sujith71955 May 8, 2017
2fdaeb5
[SPARKR][DOC] fix typo in vignettes
May 8, 2017
0f820e2
[SPARK-20519][SQL][CORE] Modify to prevent some possible runtime exce…
10110346 May 8, 2017
1552665
[SPARK-19956][CORE] Optimize a location order of blocks with topology…
ConeyLiu May 8, 2017
58518d0
[SPARK-20596][ML][TEST] Consolidate and improve ALS recommendAll test…
May 8, 2017
aeb2ecc
[SPARK-20621][DEPLOY] Delete deprecated config parameter in 'spark-en…
ConeyLiu May 8, 2017
829cd7b
[SPARK-20605][CORE][YARN][MESOS] Deprecate not used AM and executor p…
jerryshao May 8, 2017
2abfee1
[SPARK-20661][SPARKR][TEST] SparkR tableNames() test fails
falaki May 8, 2017
b952b44
[SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames() test fails
felixcheung May 9, 2017
8079424
[SPARK-11968][MLLIB] Optimize MLLIB ALS recommendForAll
May 9, 2017
10b00ab
[SPARK-20587][ML] Improve performance of ML ALS recommendForAll
May 9, 2017
be53a78
[SPARK-20615][ML][TEST] SparseVector.argmax throws IndexOutOfBoundsEx…
May 9, 2017
b8733e0
[SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML
yanboliang May 9, 2017
0d00c76
[SPARK-20667][SQL][TESTS] Cleanup the cataloged metadata after comple…
gatorsmile May 9, 2017
714811d
[SPARK-20311][SQL] Support aliases for table value functions
maropu May 9, 2017
181261a
[SPARK-20355] Add per application spark version on the history server…
May 9, 2017
f561a76
[SPARK-20548][FLAKY-TEST] share one REPL instance among REPL test cases
cloud-fan May 9, 2017
d099f41
[SPARK-20674][SQL] Support registering UserDefinedFunction as named UDF
rxin May 9, 2017
25ee816
[SPARK-19876][BUILD] Move Trigger.java to java source hierarchy
srowen May 9, 2017
1b85bcd
[SPARK-20627][PYSPARK] Drop the hadoop distirbution name from the Pyt…
holdenk May 9, 2017
ac1ab6b
Revert "[SPARK-12297][SQL] Hive compatibility for Parquet Timestamps"
rxin May 9, 2017
f79aa28
Revert "[SPARK-20311][SQL] Support aliases for table value functions"
yhuai May 9, 2017
c0189ab
[SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataFrame.withWate…
uncleGen May 9, 2017
771abeb
[SPARK-17685][SQL] Make SortMergeJoinExec's currentVars is null when …
wangyum May 10, 2017
3d2131a
[SPARK-20590][SQL] Use Spark internal datasource if multiples are fou…
HyukjinKwon May 10, 2017
a90c5cd
[SPARK-20686][SQL] PropagateEmptyRelation incorrectly handles aggrega…
JoshRosen May 10, 2017
a819dab
[SPARK-20670][ML] Simplify FPGrowth transform
YY-OnCall May 10, 2017
0ef16bd
[SPARK-20668][SQL] Modify ScalaUDF to handle nullability.
ueshin May 10, 2017
804949c
[SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsisten…
zero323 May 10, 2017
ca4625e
[SPARK-20630][WEB UI] Fixed column visibility in Executor Tab
ajbozarth May 10, 2017
a4cbf26
[SPARK-20637][CORE] Remove mention of old RDD classes from comments
michaelmior May 10, 2017
b512233
[SPARK-20393][WEBU UI] Strengthen Spark to prevent XSS vulnerabilities
n-marion May 10, 2017
789bdbe
[SPARK-20688][SQL] correctly check analysis for scalar sub-queries
cloud-fan May 10, 2017
76e4a55
[SPARK-20678][SQL] Ndv for columns not in filter condition should als…
May 10, 2017
fcb88f9
[MINOR][BUILD] Fix lint-java breaks.
ConeyLiu May 10, 2017
5c2c4dc
[SPARK-19447] Remove remaining references to generated rows metric
ala May 10, 2017
af8b6cc
[SPARK-20689][PYSPARK] python doctest leaking bucketed table
felixcheung May 10, 2017
8ddbc43
[SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ …
JoshRosen May 10, 2017
0698e6c
[SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for…
yanboliang May 11, 2017
65accb8
[SPARK-17029] make toJSON not go through rdd form but operate on data…
May 11, 2017
b4c99f4
[SPARK-20569][SQL] RuntimeReplaceable functions should not take extra…
cloud-fan May 11, 2017
8c67aa7
[SPARK-20311][SQL] Support aliases for table value functions
maropu May 11, 2017
3aa4e46
[SPARK-20416][SQL] Print UDF names in EXPLAIN
maropu May 11, 2017
7144b51
[SPARK-20600][SS] KafkaRelation should be pretty printed in web UI
jaceklaskowski May 11, 2017
04901dd
[SPARK-20431][SQL] Specify a schema by using a DDL-formatted string
maropu May 11, 2017
609ba5f
[SPARK-20399][SQL] Add a config to fallback string literal parsing co…
viirya May 12, 2017
2b36eb6
[SPARK-20665][SQL] Bround" and "Round" function return NULL
10110346 May 12, 2017
c8da535
[SPARK-20718][SQL] FileSourceScanExec with different filter orders sh…
May 12, 2017
888b84a
[SPARK-20704][SPARKR] change CRAN test to run single thread
felixcheung May 12, 2017
af40bb1
[SPARK-20619][ML] StringIndexer supports multiple ways to order label
May 12, 2017
720708c
[SPARK-20639][SQL] Add single argument support for to_timestamp in SQ…
HyukjinKwon May 12, 2017
fc8a2b6
[SPARK-20554][BUILD] Remove usage of scala.language.reflectiveCalls
srowen May 12, 2017
b236933
[SPARK-17424] Fix unsound substitution bug in ScalaReflection.
rdblue May 12, 2017
54b4f2a
[SPARK-20718][SQL][FOLLOWUP] Fix canonicalization for HiveTableScanExec
May 12, 2017
92ea7fd
[SPARK-20710][SQL] Support aliases in CUBE/ROLLUP/GROUPING SETS
maropu May 12, 2017
b526f70
[SPARK-19951][SQL] Add string concatenate operator || to Spark SQL
maropu May 12, 2017
7d6ff39
[SPARK-20702][CORE] TaskContextImpl.markTaskCompleted should not hide…
zsxwing May 12, 2017
0d3a631
[SPARK-20714][SS] Fix match error when watermark is set with timeout …
tdas May 12, 2017
e3d2022
[SPARK-20594][SQL] The staging directory should be a child directory …
May 12, 2017
b84ff7e
[SPARK-20719][SQL] Support LIMIT ALL
gatorsmile May 12, 2017
3f98375
[SPARK-18772][SQL] Avoid unnecessary conversion try for special float…
HyukjinKwon May 13, 2017
c2c1c5b
respect both gpu and maxgpu
Mar 10, 2017
c5c5c37
Merge branch 'ji/hard_limit_on_gpu' of https://github.com/yanji84/spa…
May 13, 2017
ba87b35
fix syntax
May 13, 2017
5ef2881
fix gpu offer
May 14, 2017
c301f3d
syntax fix
May 14, 2017
7a07742
pass all tests
May 15, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object s…
…tore access.

## What changes were proposed in this pull request?

Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.

It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.

There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.

(this is the successor to #12004; I can't re-open it)

## How was this patch tested?

Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)

Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.

Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.

SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
maven build `mvn install -Phadoop-cloud -Phadoop-2.7`

This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.

Author: Steve Loughran <[email protected]>
Author: Steve Loughran <[email protected]>

Closes #17834 from steveloughran/cloud/SPARK-7481-current.
  • Loading branch information
steveloughran authored and srowen committed May 7, 2017
commit 2cf83c47838115f71419ba5b9296c69ec1d746cd
14 changes: 14 additions & 0 deletions assembly/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -226,5 +226,19 @@
<parquet.deps.scope>provided</parquet.deps.scope>
</properties>
</profile>

<!--
Pull in spark-hadoop-cloud and its associated JARs,
-->
<profile>
<id>hadoop-cloud</id>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hadoop-cloud_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
</dependencies>
</profile>
</profiles>
</project>
200 changes: 200 additions & 0 deletions docs/cloud-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
---
layout: global
displayTitle: Integration with Cloud Infrastructures
title: Integration with Cloud Infrastructures
description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
---
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

* This will become a table of contents (this text will be scraped).
{:toc}

## Introduction


All major cloud providers offer persistent data storage in *object stores*.
These are not classic "POSIX" file systems.
In order to store hundreds of petabytes of data without any single points of failure,
object stores replace the classic filesystem directory tree
with a simpler model of `object-name => data`. To enable remote access, operations
on objects are usually offered as (slow) HTTP REST operations.

Spark can read and write data in object stores through filesystem connectors implemented
in Hadoop or provided by the infrastructure suppliers themselves.
These connectors make the object stores look *almost* like filesystems, with directories and files
and the classic operations on them such as list, delete and rename.


### Important: Cloud Object Stores are Not Real Filesystems

While the stores appear to be filesystems, underneath
they are still object stores, [and the difference is significant](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)

They cannot be used as a direct replacement for a cluster filesystem such as HDFS
*except where this is explicitly stated*.

Key differences are:

* Changes to stored objects may not be immediately visible, both in directory listings and actual data access.
* The means by which directories are emulated may make working with them slow.
* Rename operations may be very slow and, on failure, leave the store in an unknown state.
* Seeking within a file may require new HTTP calls, hurting performance.

How does this affect Spark?

1. Reading and writing data can be significantly slower than working with a normal filesystem.
1. Some directory structures may be very inefficient to scan during query split calculation.
1. The output of work may not be immediately visible to a follow-on query.
1. The rename-based algorithm by which Spark normally commits work when saving an RDD, DataFrame or Dataset
is potentially both slow and unreliable.

For these reasons, it is not always safe to use an object store as a direct destination of queries, or as
an intermediate store in a chain of queries. Consult the documentation of the object store and its
connector to determine which uses are considered safe.

In particular: *without some form of consistency layer, Amazon S3 cannot
be safely used as the direct destination of work with the normal rename-based committer.*

### Installation

With the relevant libraries on the classpath and Spark configured with valid credentials,
objects can be can be read or written by using their URLs as the path to data.
For example `sparkContext.textFile("s3a://landsat-pds/scene_list.gz")` will create
an RDD of the file `scene_list.gz` stored in S3, using the s3a connector.

To add the relevant libraries to an application's classpath, include the `hadoop-cloud`
module and its dependencies.

In Maven, add the following to the `pom.xml` file, assuming `spark.version`
is set to the chosen version of Spark:

{% highlight xml %}
<dependencyManagement>
...
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>hadoop-cloud_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
...
</dependencyManagement>
{% endhighlight %}

Commercial products based on Apache Spark generally directly set up the classpath
for talking to cloud infrastructures, in which case this module may not be needed.

### Authenticating

Spark jobs must authenticate with the object stores to access data within them.

1. When Spark is running in a cloud infrastructure, the credentials are usually automatically set up.
1. `spark-submit` reads the `AWS_ACCESS_KEY`, `AWS_SECRET_KEY`
and `AWS_SESSION_TOKEN` environment variables and sets the associated authentication options
for the `s3n` and `s3a` connectors to Amazon S3.
1. In a Hadoop cluster, settings may be set in the `core-site.xml` file.
1. Authentication details may be manually added to the Spark configuration in `spark-default.conf`
1. Alternatively, they can be programmatically set in the `SparkConf` instance used to configure
the application's `SparkContext`.

*Important: never check authentication secrets into source code repositories,
especially public ones*

Consult [the Hadoop documentation](https://hadoop.apache.org/docs/current/) for the relevant
configuration and security options.

## Configuring

Each cloud connector has its own set of configuration parameters, again,
consult the relevant documentation.

### Recommended settings for writing to object stores

For object stores whose consistency model means that rename-based commits are safe
use the `FileOutputCommitter` v2 algorithm for performance:

```
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
```

This does less renaming at the end of a job than the "version 1" algorithm.
As it still uses `rename()` to commit files, it is unsafe to use
when the object store does not have consistent metadata/listings.

The committer can also be set to ignore failures when cleaning up temporary
files; this reduces the risk that a transient network problem is escalated into a
job failure:

```
spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
```

As storing temporary files can run up charges; delete
directories called `"_temporary"` on a regular basis to avoid this.

### Parquet I/O Settings

For optimal performance when working with Parquet data use the following settings:

```
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.parquet.mergeSchema false
spark.sql.parquet.filterPushdown true
spark.sql.hive.metastorePartitionPruning true
```

These minimise the amount of data read during queries.

### ORC I/O Settings

For best performance when working with ORC data, use these settings:

```
spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true
```

Again, these minimise the amount of data read during queries.

## Spark Streaming and Object Storage

Spark Streaming can monitor files added to object stores, by
creating a `FileInputDStream` to monitor a path in the store through a call to
`StreamingContext.textFileStream()`.

1. The time to scan for new files is proportional to the number of files
under the path, not the number of *new* files, so it can become a slow operation.
The size of the window needs to be set to handle this.

1. Files only appear in an object store once they are completely written; there
is no need for a worklow of write-then-rename to ensure that files aren't picked up
while they are still being written. Applications can write straight to the monitored directory.

1. Streams should only be checkpointed to an store implementing a fast and
atomic `rename()` operation Otherwise the checkpointing may be slow and potentially unreliable.

## Further Reading

Here is the documentation on the standard connectors both from Apache and the cloud providers.

* [OpenStack Swift](https://hadoop.apache.org/docs/current/hadoop-openstack/index.html). Hadoop 2.6+
* [Azure Blob Storage](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). Since Hadoop 2.7
* [Azure Data Lake](https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html). Since Hadoop 2.8
* [Amazon S3 via S3A and S3N](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). Hadoop 2.6+
* [Amazon EMR File System (EMRFS)](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html). From Amazon
* [Google Cloud Storage Connector for Spark and Hadoop](https://cloud.google.com/hadoop/google-cloud-storage-connector). From Google


1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,7 @@ options for deployment:
* [Security](security.html): Spark security support
* [Hardware Provisioning](hardware-provisioning.html): recommendations for cluster hardware
* Integration with other storage systems:
* [Cloud Infrastructures](cloud-integration.html)
* [OpenStack Swift](storage-openstack-swift.html)
* [Building Spark](building-spark.html): build Spark using the Maven system
* [Contributing to Spark](http://spark.apache.org/contributing.html)
Expand Down
6 changes: 3 additions & 3 deletions docs/rdd-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -323,7 +323,7 @@ One important parameter for parallel collections is the number of *partitions* t

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc. Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), and any other Hadoop [InputFormat](http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html).

Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3n://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3a://`, etc URI) and reads it as a collection of lines. Here is an example invocation:

{% highlight scala %}
scala> val distFile = sc.textFile("data.txt")
Expand Down Expand Up @@ -356,7 +356,7 @@ Apart from text files, Spark's Scala API also supports several other data format

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc. Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), and any other Hadoop [InputFormat](http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html).

Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3n://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3a://`, etc URI) and reads it as a collection of lines. Here is an example invocation:

{% highlight java %}
JavaRDD<String> distFile = sc.textFile("data.txt");
Expand Down Expand Up @@ -388,7 +388,7 @@ Apart from text files, Spark's Java API also supports several other data formats

PySpark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc. Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), and any other Hadoop [InputFormat](http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html).

Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3n://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3a://`, etc URI) and reads it as a collection of lines. Here is an example invocation:

{% highlight python %}
>>> distFile = sc.textFile("data.txt")
Expand Down
38 changes: 12 additions & 26 deletions docs/storage-openstack-swift.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ same URI formats as in Hadoop. You can specify a path in Swift as input through
URI of the form <code>swift://container.PROVIDER/path</code>. You will also need to set your
Swift security credentials, through <code>core-site.xml</code> or via
<code>SparkContext.hadoopConfiguration</code>.
Current Swift driver requires Swift to use Keystone authentication method.
The current Swift driver requires Swift to use the Keystone authentication method, or
its Rackspace-specific predecessor.

# Configuring Swift for Better Data Locality

Expand All @@ -19,41 +20,30 @@ Although not mandatory, it is recommended to configure the proxy server of Swift

# Dependencies

The Spark application should include <code>hadoop-openstack</code> dependency.
The Spark application should include <code>hadoop-openstack</code> dependency, which can
be done by including the `hadoop-cloud` module for the specific version of spark used.
For example, for Maven support, add the following to the <code>pom.xml</code> file:

{% highlight xml %}
<dependencyManagement>
...
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-openstack</artifactId>
<version>2.3.0</version>
<groupId>org.apache.spark</groupId>
<artifactId>hadoop-cloud_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
...
</dependencyManagement>
{% endhighlight %}


# Configuration Parameters

Create <code>core-site.xml</code> and place it inside Spark's <code>conf</code> directory.
There are two main categories of parameters that should to be configured: declaration of the
Swift driver and the parameters that are required by Keystone.
The main category of parameters that should be configured are the authentication parameters
required by Keystone.

Configuration of Hadoop to use Swift File system achieved via

<table class="table">
<tr><th>Property Name</th><th>Value</th></tr>
<tr>
<td>fs.swift.impl</td>
<td>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</td>
</tr>
</table>

Additional parameters required by Keystone (v2.0) and should be provided to the Swift driver. Those
parameters will be used to perform authentication in Keystone to access Swift. The following table
contains a list of Keystone mandatory parameters. <code>PROVIDER</code> can be any name.
The following table contains a list of Keystone mandatory parameters. <code>PROVIDER</code> can be
any (alphanumeric) name.

<table class="table">
<tr><th>Property Name</th><th>Meaning</th><th>Required</th></tr>
Expand Down Expand Up @@ -94,7 +84,7 @@ contains a list of Keystone mandatory parameters. <code>PROVIDER</code> can be a
</tr>
<tr>
<td><code>fs.swift.service.PROVIDER.public</code></td>
<td>Indicates if all URLs are public</td>
<td>Indicates whether to use the public (off cloud) or private (in cloud; no transfer fees) endpoints</td>
<td>Mandatory</td>
</tr>
</table>
Expand All @@ -104,10 +94,6 @@ defined for tenant <code>test</code>. Then <code>core-site.xml</code> should inc

{% highlight xml %}
<configuration>
<property>
<name>fs.swift.impl</name>
<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value>
</property>
<property>
<name>fs.swift.service.SparkTest.auth.url</name>
<value>http://127.0.0.1:5000/v2.0/tokens</value>
Expand Down
Loading