Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
774 commits
Select commit Hold shift + click to select a range
b30a7d2
[SPARK-23572][DOCS] Bring "security.md" up to date.
Mar 26, 2018
3e778f5
[SPARK-23162][PYSPARK][ML] Add r2adj into Python API in LinearRegress…
kevinyu98 Mar 26, 2018
35997b5
[SPARK-23794][SQL] Make UUID as stateful expression
viirya Mar 27, 2018
c68ec4e
[SPARK-23096][SS] Migrate rate source to V2
jerryshao Mar 27, 2018
ed72bad
[SPARK-23699][PYTHON][SQL] Raise same type of error caught with Arrow…
BryanCutler Mar 28, 2018
34c4b9c
[SPARK-23765][SQL] Supports custom line separator for json datasource
HyukjinKwon Mar 28, 2018
761565a
Revert "[SPARK-23096][SS] Migrate rate source to V2"
gatorsmile Mar 28, 2018
ea2fdc0
[SPARK-23675][WEB-UI] Title add spark logo, use spark logo image
Mar 29, 2018
641aec6
[SPARK-23806] Broadcast.unpersist can cause fatal exception when used…
Mar 29, 2018
505480c
[SPARK-23770][R] Exposes repartitionByRange in SparkR
HyukjinKwon Mar 29, 2018
491ec11
[SPARK-23785][LAUNCHER] LauncherBackend doesn't check state of connec…
Mar 29, 2018
a7755fd
[SPARK-23639][SQL] Obtain token before init metastore client in Spark…
yaooqinn Mar 29, 2018
b348901
[SPARK-23808][SQL] Set default Spark session in test-only spark sessi…
jose-torres Mar 30, 2018
df05fb6
[SPARK-23743][SQL] Changed a comparison logic from containing 'slf4j'…
jongyoul Mar 30, 2018
b02e76c
[SPARK-23727][SQL] Support for pushing down filters for DateType in p…
yucai Mar 30, 2018
5b5a36e
Roll forward "[SPARK-23096][SS] Migrate rate source to V2"
jose-torres Mar 30, 2018
bc8d093
[SPARK-23500][SQL][FOLLOWUP] Fix complex type simplification rules to…
gatorsmile Mar 30, 2018
ae91720
[SPARK-23640][CORE] Fix hadoop config may override spark config
wangyum Mar 30, 2018
15298b9
[SPARK-23827][SS] StreamingJoinExec should ensure that input data is …
tdas Mar 30, 2018
529f847
[SPARK-23040][CORE][FOLLOW-UP] Avoid double wrap result Iterator.
jiangxb1987 Mar 31, 2018
44a9f8e
[SPARK-15009][PYTHON][FOLLOWUP] Add default param checks for CountVec…
BryanCutler Apr 2, 2018
6151f29
[SPARK-23825][K8S] Requesting memory + memory overhead for pod memory
dvogelbacher Apr 2, 2018
fe2b7a4
[SPARK-23285][K8S] Add a config property for specifying physical exec…
liyinan926 Apr 2, 2018
a7c19d9
[SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder classes
kiszk Apr 2, 2018
28ea4e3
[SPARK-23834][TEST] Wait for connection before disconnect in Launcher…
Apr 2, 2018
a135182
[SPARK-23690][ML] Add handleinvalid to VectorAssembler
Apr 2, 2018
441d0d0
[SPARK-19964][CORE] Avoid reading from remote repos in SparkSubmitSuite.
Apr 3, 2018
8020f66
[MINOR][DOC] Fix a few markdown typos
Apr 3, 2018
7cf9fab
[MINOR][CORE] Show block manager id when remove RDD/Broadcast fails.
jiangxb1987 Apr 3, 2018
66a3a5a
[SPARK-23099][SS] Migrate foreach sink to DataSourceV2
jose-torres Apr 3, 2018
1035aaa
[SPARK-23587][SQL] Add interpreted execution for MapObjects expression
viirya Apr 3, 2018
359375e
[SPARK-23809][SQL] Active SparkSession should be set by getOrCreate
ericl Apr 4, 2018
5cfd5fa
[SPARK-23802][SQL] PropagateEmptyRelation can leave query plan in unr…
Apr 4, 2018
16ef6ba
[SPARK-23826][TEST] TestHiveSparkSession should set default session
gatorsmile Apr 4, 2018
5197562
[SPARK-21351][SQL] Update nullability based on children's output
maropu Apr 4, 2018
a355236
[SPARK-23583][SQL] Invoke should support interpreted execution
kiszk Apr 4, 2018
cccaaa1
[SPARK-23668][K8S] Add config option for passing through k8s Pod.spec…
Apr 4, 2018
d8379e5
[SPARK-23838][WEBUI] Running SQL query is displayed as "completed" in…
gengliangwang Apr 4, 2018
d3bd043
[SPARK-23637][YARN] Yarn might allocate more resource if a same execu…
Apr 4, 2018
c5c8b54
[SPARK-23593][SQL] Add interpreted execution for InitializeJavaBean e…
viirya Apr 5, 2018
1822ecd
[SPARK-23582][SQL] StaticInvoke should support interpreted execution
kiszk Apr 5, 2018
b2329fb
Revert "[SPARK-23593][SQL] Add interpreted execution for InitializeJa…
hvanhovell Apr 5, 2018
d9ca1c9
[SPARK-23593][SQL] Add interpreted execution for InitializeJavaBean e…
viirya Apr 5, 2018
4807d38
[SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks to choose se…
kiszk Apr 6, 2018
f2ac087
[SPARK-23870][ML] Forward RFormula handleInvalid Param to VectorAssem…
Apr 6, 2018
d65e531
[SPARK-23823][SQL] Keep origin in transformExpression
Apr 6, 2018
249007e
[SPARK-19724][SQL] create a managed table with an existed default tab…
gengliangwang Apr 6, 2018
6ade5cb
[MINOR][DOC] Fix some typos and grammar issues
dsakuma Apr 6, 2018
9452401
[SPARK-23822][SQL] Improve error message for Parquet schema mismatches
yuchenhuo Apr 6, 2018
d766ea2
[SPARK-23861][SQL][DOC] Clarify default window frame with and without…
icexelloss Apr 6, 2018
c926acf
[SPARK-23882][CORE] UTF8StringSuite.writeToOutputStreamUnderflow() is…
kiszk Apr 6, 2018
d23a805
[SPARK-23859][ML] Initial PR for Instrumentation improvements: UUID a…
MrBago Apr 6, 2018
b6935ff
[SPARK-10399][SPARK-23879][HOTFIX] Fix Java lint errors
kiszk Apr 6, 2018
e998250
[SPARK-23828][ML][PYTHON] PySpark StringIndexerModel should have cons…
huaxingao Apr 6, 2018
6ab134c
[SPARK-21898][ML][FOLLOWUP] Fix Scala 2.12 build.
ueshin Apr 6, 2018
2c1fe64
[SPARK-23847][PYTHON][SQL] Add asc_nulls_first, asc_nulls_last to PyS…
huaxingao Apr 8, 2018
6a73457
[SPARK-23849][SQL] Tests for the samplingRatio option of JSON datasource
MaxGekk Apr 8, 2018
710a68c
[SPARK-23892][TEST] Improve converge and fix lint error in UTF8String…
kiszk Apr 8, 2018
8d40a79
[SPARK-23893][CORE][SQL] Avoid possible integer overflow in multiplic…
kiszk Apr 8, 2018
32471ba
Fix typo in Python docstring kinesis example
Apr 9, 2018
d81f29e
[SPARK-23881][CORE][TEST] Fix flaky test JobCancellationSuite."interr…
jiangxb1987 Apr 9, 2018
10f45bb
[SPARK-23816][CORE] Killed tasks should ignore FetchFailures.
squito Apr 9, 2018
7c1654e
[SPARK-22856][SQL] Add wrappers for codegen output and nullability
viirya Apr 9, 2018
252468a
[SPARK-14681][ML] Provide label/impurity stats for spark.ml decision …
WeichenXu123 Apr 9, 2018
61b7247
[INFRA] Close stale PRs.
Apr 9, 2018
f94f362
[SPARK-23947][SQL] Add hashUTF8String convenience method to hasher cl…
rednaxelafx Apr 10, 2018
6498884
[SPARK-23898][SQL] Simplify add & subtract code generation
hvanhovell Apr 10, 2018
95034af
[SPARK-23841][ML] NodeIdCache should unpersist the last cached nodeId…
zhengruifeng Apr 10, 2018
3323b15
[SPARK-23864][SQL] Add unsafe object writing to UnsafeWriter
hvanhovell Apr 10, 2018
e179658
[SPARK-19724][SQL][FOLLOW-UP] Check location of managed table when ig…
gengliangwang Apr 10, 2018
adb222b
[SPARK-23751][ML][PYSPARK] Kolmogorov-Smirnoff test Python API in pys…
WeichenXu123 Apr 10, 2018
4f1e8b9
[SPARK-23871][ML][PYTHON] add python api for VectorAssembler handleIn…
huaxingao Apr 10, 2018
7c7570d
[SPARK-23944][ML] Add the set method for the two LSHModel
lu-wang-dl Apr 11, 2018
c7622be
[SPARK-23847][FOLLOWUP][PYTHON][SQL] Actually test [desc|acs]_nulls_[…
HyukjinKwon Apr 11, 2018
87611bb
[MINOR][DOCS] Fix R documentation generation instruction for roxygen2
HyukjinKwon Apr 11, 2018
c604d65
[SPARK-23951][SQL] Use actual java class instead of string representa…
hvanhovell Apr 11, 2018
271c891
[SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars as transient
rednaxelafx Apr 11, 2018
653fe02
[SPARK-6951][CORE] Speed up parsing of event logs during listing.
Apr 11, 2018
3cb8204
[SPARK-22941][CORE] Do not exit JVM when submit fails with in-process…
Apr 11, 2018
75a1830
[SPARK-22883] ML test for StructuredStreaming: spark.ml.feature, I-M
jkbradley Apr 11, 2018
9d960de
typo rawPredicition changed to rawPrediction
JBauerKogentix Apr 11, 2018
e904dfa
Revert "[SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars as t…
gatorsmile Apr 12, 2018
6a2289e
[SPARK-23962][SQL][TEST] Fix race in currentExecutionIds().
squito Apr 12, 2018
0b19122
[SPARK-23762][SQL] UTF8StringBuffer uses MemoryBlock
kiszk Apr 12, 2018
0f93b91
[SPARK-23751][FOLLOW-UP] fix build for scala-2.12
WeichenXu123 Apr 12, 2018
682002b
[SPARK-23867][SCHEDULER] use droppedCount in logWarning
Apr 13, 2018
14291b0
[SPARK-23748][SS] Fix SS continuous process doesn't support SubqueryA…
jerryshao Apr 13, 2018
ab7b961
[SPARK-23942][PYTHON][SQL] Makes collect in PySpark as action for a q…
HyukjinKwon Apr 13, 2018
1018be4
[SPARK-23971] Should not leak Spark sessions across test suites
ericl Apr 13, 2018
4b07036
[SPARK-23815][CORE] Spark writer dynamic partition overwrite mode may…
Apr 13, 2018
0323e61
[SPARK-23905][SQL] Add UDF weekday
yucai Apr 13, 2018
a83ae0d
[SPARK-22839][K8S] Refactor to unify driver and executor pod builder …
mccheah Apr 13, 2018
4dfd746
[SPARK-23896][SQL] Improve PartitioningAwareFileIndex
gengliangwang Apr 13, 2018
25892f3
[SPARK-23375][SQL] Eliminate unneeded Sort in Optimizer
mgaido91 Apr 13, 2018
558f31b
[SPARK-23963][SQL] Properly handle large number of columns in query o…
bersprockets Apr 13, 2018
cbb41a0
[SPARK-23966][SS] Refactoring all checkpoint file writing logic in a …
tdas Apr 13, 2018
73f2853
[SPARK-23979][SQL] MultiAlias should not be a CodegenFallback
viirya Apr 14, 2018
c096493
[SPARK-23956][YARN] Use effective RPC port in AM registration
gerashegalov Apr 16, 2018
6931022
[SPARK-23917][SQL] Add array_max function
mgaido91 Apr 16, 2018
083cf22
[SPARK-21033][CORE][FOLLOW-UP] Update Spillable
wangyum Apr 16, 2018
5003736
[SPARK-9312][ML] Add RawPrediction, numClasses, and numFeatures for O…
lu-wang-dl Apr 16, 2018
0461482
[SPARK-21088][ML] CrossValidator, TrainValidationSplit support collec…
WeichenXu123 Apr 16, 2018
fd990a9
[SPARK-23873][SQL] Use accessors in interpreted LambdaVariable
viirya Apr 16, 2018
14844a6
[SPARK-23918][SQL] Add array_min function
mgaido91 Apr 17, 2018
1cc66a0
[SPARK-23687][SS] Add a memory source for continuous processing.
jose-torres Apr 17, 2018
05ae747
[SPARK-23747][STRUCTURED STREAMING] Add EpochCoordinator unit tests
Apr 17, 2018
30ffb53
[SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData
viirya Apr 17, 2018
0a9172a
[SPARK-23835][SQL] Add not-null check to Tuples' arguments deserializ…
mgaido91 Apr 17, 2018
ed4101d
[SPARK-22676] Avoid iterating all partition paths when spark.sql.hive…
Apr 17, 2018
3990daa
[SPARK-23948] Trigger mapstage's job listener in submitMissingTasks
Apr 17, 2018
f39e82c
[SPARK-23986][SQL] freshName can generate non-unique names
mgaido91 Apr 17, 2018
1ca3c50
[SPARK-21741][ML][PYSPARK] Python API for DataFrame-based multivariat…
WeichenXu123 Apr 17, 2018
5fccdae
[SPARK-22968][DSTREAM] Throw an exception on partition revoking issue
jerryshao Apr 18, 2018
1e3b876
[SPARK-21479][SQL] Outer join filter pushdown in null supplying table…
maryannxue Apr 18, 2018
310a8cd
[SPARK-23341][SQL] define some standard options for data source v2
cloud-fan Apr 18, 2018
cce4694
[SPARK-24002][SQL] Task not serializable caused by org.apache.parquet…
gatorsmile Apr 18, 2018
f81fa47
[SPARK-23926][SQL] Extending reverse function to support ArrayType ar…
Apr 18, 2018
f09a9e9
[SPARK-24007][SQL] EqualNullSafe for FloatType and DoubleType might g…
ueshin Apr 18, 2018
a906647
[SPARK-23875][SQL][FOLLOWUP] Add IndexedSeq wrapper for ArrayData
viirya Apr 18, 2018
0c94e48
[SPARK-23775][TEST] Make DataFrameRangeSuite not flaky
gaborgsomogyi Apr 18, 2018
8bb0df2
[SPARK-24014][PYSPARK] Add onStreamingStarted method to StreamingList…
viirya Apr 19, 2018
d5bec48
[SPARK-23919][SQL] Add array_position function
kiszk Apr 19, 2018
46bb2b5
[SPARK-23924][SQL] Add element_at function
kiszk Apr 19, 2018
1b08c43
[SPARK-23584][SQL] NewInstance should support interpreted execution
maropu Apr 19, 2018
e134165
[SPARK-23588][SQL] CatalystToExternalMap should support interpreted e…
maropu Apr 19, 2018
9e10f69
[SPARK-22676][FOLLOW-UP] fix code style for test.
Apr 19, 2018
d96c3e3
[SPARK-21811][SQL] Fix the inconsistency behavior when finding the wi…
jiangxb1987 Apr 19, 2018
0deaa52
[SPARK-24021][CORE] fix bug in BlacklistTracker's updateBlacklistForF…
Ngone51 Apr 19, 2018
6e19f76
[SPARK-23989][SQL] exchange should copy data before non-serialized sh…
cloud-fan Apr 19, 2018
a471880
[SPARK-24026][ML] Add Power Iteration Clustering to spark.ml
wangmiao1981 Apr 19, 2018
9ea8d3d
[SPARK-22362][SQL] Add unit test for Window Aggregate Functions
attilapiros Apr 19, 2018
e55953b
[SPARK-24022][TEST] Make SparkContextSuite not flaky
gaborgsomogyi Apr 19, 2018
b3fde5a
[SPARK-23877][SQL] Use filter predicates to prune partitions in metad…
rdblue Apr 20, 2018
e6b4660
[SPARK-23736][SQL] Extending the concat function to support array col…
Apr 20, 2018
074a7f9
[SPARK-23588][SQL][FOLLOW-UP] Resolve a map builder method per execut…
maropu Apr 20, 2018
0dd97f6
[SPARK-23595][SQL] ValidateExternalType should support interpreted ex…
maropu Apr 20, 2018
1d758dc
Revert "[SPARK-23775][TEST] Make DataFrameRangeSuite not flaky"
Apr 20, 2018
32b4bcd
[SPARK-24029][CORE] Set SO_REUSEADDR on listen sockets.
Apr 21, 2018
7bc853d
[SPARK-24033][SQL] Fix Mismatched of Window Frame specifiedwindowfram…
gatorsmile Apr 21, 2018
c48085a
[SPARK-23799][SQL] FilterEstimation.evaluateInSet produces devision b…
Apr 22, 2018
c3a86fa
[SPARK-10399][SPARK-23879][FOLLOWUP][CORE] Free unused off-heap memor…
kiszk Apr 23, 2018
f70f46d
[SPARK-23877][SQL][FOLLOWUP] use PhysicalOperation to simplify the ha…
cloud-fan Apr 23, 2018
d87d30e
[SPARK-23564][SQL] infer additional filters from constraints for join…
cloud-fan Apr 23, 2018
afbdf42
[SPARK-23589][SQL] ExternalMapToCatalyst should support interpreted e…
maropu Apr 23, 2018
293a0f2
[Spark-24024][ML] Fix poisson deviance calculations in GLM to handle …
tengpeng Apr 23, 2018
448d248
[SPARK-21168] KafkaRDD should always set kafka clientId.
liu-zhaokun Apr 23, 2018
770add8
[SPARK-23004][SS] Ensure StateStore.commit is called only once in a s…
tdas Apr 23, 2018
e82cb68
[SPARK-11237][ML] Add pmml export for k-means in Spark ML
holdenk Apr 23, 2018
c8f3ac6
[SPARK-23888][CORE] correct the comment of hasAttemptOnHost()
Ngone51 Apr 23, 2018
428b903
[SPARK-24029][CORE] Follow up: set SO_REUSEADDR on the server socket.
Apr 24, 2018
281c1ca
[SPARK-23973][SQL] Remove consecutive Sorts
mgaido91 Apr 24, 2018
c303b1b
[MINOR][DOCS] Fix comments of SQLExecution#withExecutionId
seancxmao Apr 24, 2018
87e8a57
[SPARK-24054][R] Add array_position function / element_at functions
HyukjinKwon Apr 24, 2018
4926a7c
[SPARK-23589][SQL][FOLLOW-UP] Reuse InternalRow in ExternalMapToCatal…
maropu Apr 24, 2018
55c4ca8
[SPARK-22683][CORE] Add a executorAllocationRatio parameter to thrott…
jcuquemelle Apr 24, 2018
2a24c48
[SPARK-23975][ML] Allow Clustering to take Arrays of Double as input …
lu-wang-dl Apr 24, 2018
ce7ba2e
[SPARK-23807][BUILD] Add Hadoop 3.1 profile with relevant POM fix ups
steveloughran Apr 24, 2018
8301375
[SPARK-23455][ML] Default Params in ML should be saved separately in …
viirya Apr 24, 2018
379bffa
[SPARK-23990][ML] Instruments logging improvements - ML regression pa…
WeichenXu123 Apr 24, 2018
7b1e652
[SPARK-24056][SS] Make consumer creation lazy in Kafka source for Str…
tdas Apr 24, 2018
d6c26d1
[SPARK-24038][SS] Refactor continuous writing to its own class
jose-torres Apr 25, 2018
5fea17b
[SPARK-23821][SQL] Collection function: flatten
Apr 25, 2018
64e8408
[SPARK-24012][SQL] Union of map and other compatible column
liutang123 Apr 25, 2018
20ca208
[SPARK-23880][SQL] Do not trigger any jobs for caching data
maropu Apr 25, 2018
396938e
[SPARK-24050][SS] Calculate input / processing rates correctly for Da…
tdas Apr 25, 2018
ac4ca7c
[SPARK-24012][SQL][TEST][FOLLOWUP] add unit test
cloud-fan Apr 25, 2018
95a6513
[SPARK-24069][R] Add array_min / array_max functions
HyukjinKwon Apr 26, 2018
3f1e999
[SPARK-23849][SQL] Tests for samplingRatio of json datasource
MaxGekk Apr 26, 2018
58c55cb
[SPARK-23902][SQL] Add roundOff flag to months_between
mgaido91 Apr 26, 2018
cd10f9d
[SPARK-23916][SQL] Add array_join function
mgaido91 Apr 26, 2018
ffaf0f9
[SPARK-24062][THRIFT SERVER] Fix SASL encryption cannot enabled issue…
jerryshao Apr 26, 2018
d1eb8d3
[SPARK-24094][SS][MINOR] Change description strings of v2 streaming s…
tdas Apr 26, 2018
ce2f919
[SPARK-23799][SQL][FOLLOW-UP] FilterEstimation.evaluateInSet produces…
gatorsmile Apr 26, 2018
4f1e386
[SPARK-24057][PYTHON] put the real data type in the AssertionError me…
huaxingao Apr 26, 2018
f7435be
[SPARK-24044][PYTHON] Explicitly print out skipped tests from unittes…
HyukjinKwon Apr 26, 2018
9ee9fcf
[SPARK-24083][YARN] Log stacktrace for uncaught exception
caneGuy Apr 26, 2018
8aa1d7b
[SPARK-23355][SQL] convertMetastore should not ignore table properties
dongjoon-hyun Apr 27, 2018
109935f
[SPARK-23830][YARN] added check to ensure main method is found
Apr 27, 2018
2824f12
[SPARK-23565][SS] New error message for structured streaming sources …
patrickmcgloin Apr 27, 2018
3fd297a
[SPARK-24085][SQL] Query returns UnsupportedOperationException when s…
dilipbiswal Apr 27, 2018
8614edd
[SPARK-24104] SQLAppStatusListener overwrites metrics onDriverAccumUp…
juliuszsompolski Apr 27, 2018
1fb46f3
[SPARK-23688][SS] Refactor tests away from rate source
HeartSaVioR Apr 28, 2018
ad94e85
[SPARK-23736][SQL][FOLLOWUP] Error message should contains SQL types
mgaido91 Apr 28, 2018
4df5136
[SPARK-22732][SS][FOLLOW-UP] Fix MemorySinkV2 toString error
wangyum Apr 28, 2018
bd14da6
[SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding …
MaxGekk Apr 29, 2018
56f501e
[MINOR][DOCS] Fix a broken link for Arrow's supported types in the pr…
HyukjinKwon Apr 30, 2018
3121b41
[SPARK-23846][SQL] The samplingRatio option for CSV datasource
MaxGekk Apr 30, 2018
b42ad16
[SPARK-24072][SQL] clearly define pushed filters
cloud-fan Apr 30, 2018
007ae68
[SPARK-24003][CORE] Add support to provide spark.executor.extraJavaOp…
Apr 30, 2018
b857fb5
[SPARK-23853][PYSPARK][TEST] Run Hive-related PySpark tests only for …
dongjoon-hyun May 1, 2018
7bbec0d
[SPARK-24061][SS] Add TypedFilter support for continuous processing
May 1, 2018
6782359
[SPARK-23941][MESOS] Mesos task failed on specific spark app name
BounkongK May 1, 2018
e15850b
[SPARK-24131][PYSPARK] Add majorMinorVersion API to PySpark for deter…
viirya May 2, 2018
9215ee7
[SPARK-23976][CORE] Detect length overflow in UTF8String.concat()/Byt…
kiszk May 2, 2018
152eaf6
[SPARK-24107][CORE] ChunkedByteBuffer.writeFully method has not reset…
May 2, 2018
8dbf56c
[SPARK-24013][SQL] Remove unneeded compress in ApproximatePercentile
mgaido91 May 2, 2018
8bd2702
[SPARK-24133][SQL] Check for integer overflows when resizing Writable…
ala May 2, 2018
504c9cf
[SPARK-24123][SQL] Fix precision issues in monthsBetween with more th…
mgaido91 May 2, 2018
5be8aab
[SPARK-23923][SQL] Add cardinality function
kiszk May 2, 2018
e4c91c0
[SPARK-24111][SQL] Add the TPCDS v2.7 (latest) queries in TPCDSQueryB…
maropu May 2, 2018
bf4352c
[SPARK-24110][THRIFT-SERVER] Avoid UGI.loginUserFromKeytab in STS
jerryshao May 3, 2018
c9bfd1c
[SPARK-23489][SQL][TEST] HiveExternalCatalogVersionsSuite should veri…
dongjoon-hyun May 3, 2018
417ad92
[SPARK-23715][SQL] the input of to/from_utc_timestamp can not have ti…
cloud-fan May 3, 2018
991b526
[SPARK-24166][SQL] InMemoryTableScanExec should not access SQLConf at…
cloud-fan May 3, 2018
96a5001
[SPARK-24169][SQL] JsonToStructs should not access SQLConf at executo…
cloud-fan May 3, 2018
94641fe
[SPARK-23433][CORE] Late zombie task completions update all tasksets
squito May 3, 2018
e3201e1
[SPARK-24035][SQL] SQL syntax for Pivot
maryannxue May 4, 2018
e646ae6
[SPARK-24168][SQL] WindowExec should not access SQLConf at executor side
cloud-fan May 4, 2018
0c23e25
[SPARK-24167][SQL] ParquetFilters should not access SQLConf at execut…
cloud-fan May 4, 2018
7f1b6b1
[SPARK-24136][SS] Fix MemoryStreamDataReader.next to skip sleeping if…
arunmahadevan May 4, 2018
4d5de4d
[SPARK-23697][CORE] LegacyAccumulatorWrapper should define isZero cor…
cloud-fan May 4, 2018
d04806a
[SPARK-24124] Spark history server should create spark.history.store.…
May 4, 2018
af4dc50
[SPARK-24039][SS] Do continuous processing writes with multiple compu…
jose-torres May 4, 2018
47b5b68
[SPARK-24157][SS] Enabled no-data batches in MicroBatchExecution for …
tdas May 4, 2018
dd4b1b9
[SPARK-24185][SPARKR][SQL] add flatten function to SparkR
huaxingao May 6, 2018
f38ea00
[SPARK-24017][SQL] Refactor ExternalCatalog to be an interface
gatorsmile May 7, 2018
a634d66
[SPARK-24126][PYSPARK] Use build-specific temp directory for pyspark …
May 7, 2018
889f6cc
[SPARK-24143] filter empty blocks when convert mapstatus to (blockId,…
May 7, 2018
7564a9a
[SPARK-23921][SQL] Add array_sort function
kiszk May 7, 2018
d2aa859
[SPARK-24160] ShuffleBlockFetcherIterator should fail if it receives …
JoshRosen May 7, 2018
c598197
[SPARK-23775][TEST] Make DataFrameRangeSuite not flaky
gaborgsomogyi May 7, 2018
f065280
[SPARK-24160][FOLLOWUP] Fix compilation failure
mgaido91 May 7, 2018
e35ad3c
[SPARK-23930][SQL] Add slice function
mgaido91 May 7, 2018
4e861db
[SPARK-16406][SQL] Improve performance of LogicalPlan.resolve
hvanhovell May 7, 2018
d83e963
[SPARK-24043][SQL] Interpreted Predicate should initialize nondetermi…
bersprockets May 7, 2018
56a52e0
[SPARK-15750][MLLIB][PYSPARK] Constructing FPGrowth fails when no num…
zjffdu May 7, 2018
1c9c5de
[SPARK-23291][SPARK-23291][R][FOLLOWUP] Update SparkR migration note for
HyukjinKwon May 7, 2018
f48bd6b
[SPARK-22885][ML][TEST] ML test for StructuredStreaming: spark.ml.tuning
WeichenXu123 May 7, 2018
76ecd09
[SPARK-20114][ML] spark.ml parity for sequential pattern mining - Pre…
WeichenXu123 May 7, 2018
0d63eb8
[SPARK-23975][ML] Add support of array input for all clustering methods
lu-wang-dl May 8, 2018
cd12c5c
[SPARK-24128][SQL] Mention configuration option in implicit CROSS JOI…
henryr May 8, 2018
05eb19b
[SPARK-24188][CORE] Restore "/version" API endpoint.
May 8, 2018
e17567c
[SPARK-24076][SQL] Use different seed in HashAggregate to avoid hash …
yucai May 8, 2018
b54bbe5
[SPARK-24131][PYSPARK][FOLLOWUP] Add majorMinorVersion API to PySpark…
viirya May 8, 2018
2f6fe7d
[SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] Support custo…
gatorsmile May 8, 2018
487faf1
[SPARK-24117][SQL] Unified the getSizePerRow
wangyum May 8, 2018
ec7854a
re-raising StopIteration in client code
e-dorigatti May 21, 2018
fddd031
moved safe_iter to util module and more descriptive name
e-dorigatti May 22, 2018
ee54924
removed redundancy from tests
e-dorigatti May 22, 2018
d739eea
improved doc, error message and code style
e-dorigatti May 24, 2018
f0f80ed
improved tests
e-dorigatti May 24, 2018
d59f0d5
fixed style
e-dorigatti May 24, 2018
b0af18e
fixed udf and its test
e-dorigatti May 24, 2018
167a75b
preserving metadata of wrapped function
e-dorigatti May 24, 2018
90b064d
catching relevant exceptions only
e-dorigatti May 24, 2018
75316af
preserving argspecs of wrapped function
e-dorigatti May 26, 2018
026ecdd
style
e-dorigatti May 26, 2018
f7b53c2
saving argspec in udf
e-dorigatti May 29, 2018
8fac2a8
saving signature only for pandas udf, removed useless try/except
e-dorigatti May 29, 2018
5b5570b
comment explaining hack
e-dorigatti May 30, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[MINOR][DOC] Fix some typos and grammar issues
## What changes were proposed in this pull request?

Easy fix in the documentation.

## How was this patch tested?

N/A

Closes #20948

Author: Daniel Sakuma <[email protected]>

Closes #20928 from dsakuma/fix_typo_configuration_docs.
  • Loading branch information
dsakuma authored and HyukjinKwon committed Apr 6, 2018
commit 6ade5cbb498f6c6ea38779b97f2325d5cf5013f2
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ here with the Spark source code. You can also find documentation specific to rel
Spark at http://spark.apache.org/documentation.html.

Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the
documentation yourself. Why build it yourself? So that you have the docs that corresponds to
documentation yourself. Why build it yourself? So that you have the docs that correspond to
whichever version of Spark you currently have checked out of revision control.

## Prerequisites
Expand Down
2 changes: 1 addition & 1 deletion docs/_plugins/include_example.rb
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ def render(context)
begin
code = File.open(@file).read.encode("UTF-8")
rescue => e
# We need to explicitly exit on execptions here because Jekyll will silently swallow
# We need to explicitly exit on exceptions here because Jekyll will silently swallow
# them, leading to silent build failures (see https://github.com/jekyll/jekyll/issues/5104)
puts(e)
puts(e.backtrace)
Expand Down
2 changes: 1 addition & 1 deletion docs/building-spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ Note: Flume support is deprecated as of Spark 2.3.0.

## Building submodules individually

It's possible to build Spark sub-modules using the `mvn -pl` option.
It's possible to build Spark submodules using the `mvn -pl` option.

For instance, you can build the Spark Streaming module using:

Expand Down
4 changes: 2 additions & 2 deletions docs/cloud-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,13 @@ description: Introduction to cloud storage support in Apache Spark SPARK_VERSION
All major cloud providers offer persistent data storage in *object stores*.
These are not classic "POSIX" file systems.
In order to store hundreds of petabytes of data without any single points of failure,
object stores replace the classic filesystem directory tree
object stores replace the classic file system directory tree
with a simpler model of `object-name => data`. To enable remote access, operations
on objects are usually offered as (slow) HTTP REST operations.

Spark can read and write data in object stores through filesystem connectors implemented
in Hadoop or provided by the infrastructure suppliers themselves.
These connectors make the object stores look *almost* like filesystems, with directories and files
These connectors make the object stores look *almost* like file systems, with directories and files
and the classic operations on them such as list, delete and rename.


Expand Down
20 changes: 10 additions & 10 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -558,7 +558,7 @@ Apart from these, the following properties are also available, and may be useful
<td>
This configuration limits the number of remote requests to fetch blocks at any given point.
When the number of hosts in the cluster increase, it might lead to very large number
of in-bound connections to one or more nodes, causing the workers to fail under load.
of inbound connections to one or more nodes, causing the workers to fail under load.
By allowing it to limit the number of fetch requests, this scenario can be mitigated.
</td>
</tr>
Expand Down Expand Up @@ -1288,7 +1288,7 @@ Apart from these, the following properties are also available, and may be useful
<td>4194304 (4 MB)</td>
<td>
The estimated cost to open a file, measured by the number of bytes could be scanned at the same
time. This is used when putting multiple files into a partition. It is better to over estimate,
time. This is used when putting multiple files into a partition. It is better to overestimate,
then the partitions with small files will be faster than partitions with bigger files.
</td>
</tr>
Expand Down Expand Up @@ -1513,7 +1513,7 @@ Apart from these, the following properties are also available, and may be useful
<td>0.8 for KUBERNETES mode; 0.8 for YARN mode; 0.0 for standalone mode and Mesos coarse-grained mode</td>
<td>
The minimum ratio of registered resources (registered resources / total expected resources)
(resources are executors in yarn mode and Kubernetes mode, CPU cores in standalone mode and Mesos coarsed-grained
(resources are executors in yarn mode and Kubernetes mode, CPU cores in standalone mode and Mesos coarse-grained
mode ['spark.cores.max' value is total expected resources for Mesos coarse-grained mode] )
to wait for before scheduling begins. Specified as a double between 0.0 and 1.0.
Regardless of whether the minimum ratio of resources has been reached,
Expand Down Expand Up @@ -1634,7 +1634,7 @@ Apart from these, the following properties are also available, and may be useful
<td>false</td>
<td>
(Experimental) If set to "true", Spark will blacklist the executor immediately when a fetch
failure happenes. If external shuffle service is enabled, then the whole node will be
failure happens. If external shuffle service is enabled, then the whole node will be
blacklisted.
</td>
</tr>
Expand Down Expand Up @@ -1722,7 +1722,7 @@ Apart from these, the following properties are also available, and may be useful
When <code>spark.task.reaper.enabled = true</code>, this setting specifies a timeout after
which the executor JVM will kill itself if a killed task has not stopped running. The default
value, -1, disables this mechanism and prevents the executor from self-destructing. The purpose
of this setting is to act as a safety-net to prevent runaway uncancellable tasks from rendering
of this setting is to act as a safety-net to prevent runaway noncancellable tasks from rendering
an executor unusable.
</td>
</tr>
Expand Down Expand Up @@ -1915,8 +1915,8 @@ showDF(properties, numRows = 200, truncate = FALSE)
<td><code>spark.streaming.receiver.writeAheadLog.enable</code></td>
<td>false</td>
<td>
Enable write ahead logs for receivers. All the input data received through receivers
will be saved to write ahead logs that will allow it to be recovered after driver failures.
Enable write-ahead logs for receivers. All the input data received through receivers
will be saved to write-ahead logs that will allow it to be recovered after driver failures.
See the <a href="streaming-programming-guide.html#deploying-applications">deployment guide</a>
in the Spark Streaming programing guide for more details.
</td>
Expand Down Expand Up @@ -1971,7 +1971,7 @@ showDF(properties, numRows = 200, truncate = FALSE)
<td><code>spark.streaming.driver.writeAheadLog.closeFileAfterWrite</code></td>
<td>false</td>
<td>
Whether to close the file after writing a write ahead log record on the driver. Set this to 'true'
Whether to close the file after writing a write-ahead log record on the driver. Set this to 'true'
when you want to use S3 (or any file system that does not support flushing) for the metadata WAL
on the driver.
</td>
Expand All @@ -1980,7 +1980,7 @@ showDF(properties, numRows = 200, truncate = FALSE)
<td><code>spark.streaming.receiver.writeAheadLog.closeFileAfterWrite</code></td>
<td>false</td>
<td>
Whether to close the file after writing a write ahead log record on the receivers. Set this to 'true'
Whether to close the file after writing a write-ahead log record on the receivers. Set this to 'true'
when you want to use S3 (or any file system that does not support flushing) for the data WAL
on the receivers.
</td>
Expand Down Expand Up @@ -2178,7 +2178,7 @@ Spark's classpath for each application. In a Spark cluster running on YARN, thes
files are set cluster-wide, and cannot safely be changed by the application.

The better choice is to use spark hadoop properties in the form of `spark.hadoop.*`.
They can be considered as same as normal spark properties which can be set in `$SPARK_HOME/conf/spark-defalut.conf`
They can be considered as same as normal spark properties which can be set in `$SPARK_HOME/conf/spark-default.conf`

In some cases, you may want to avoid hard-coding certain configurations in a `SparkConf`. For
instance, Spark allows you to simply create an empty conf and set spark/spark hadoop properties.
Expand Down
2 changes: 1 addition & 1 deletion docs/css/pygments-default.css
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ To generate this, I had to run
But first I had to install pygments via easy_install pygments

I had to override the conflicting bootstrap style rules by linking to
this stylesheet lower in the html than the bootstap css.
this stylesheet lower in the html than the bootstrap css.

Also, I was thrown off for a while at first when I was using markdown
code block inside my {% highlight scala %} ... {% endhighlight %} tags
Expand Down
4 changes: 2 additions & 2 deletions docs/graphx-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -491,7 +491,7 @@ val joinedGraph = graph.joinVertices(uniqueCosts)(
The more general [`outerJoinVertices`][Graph.outerJoinVertices] behaves similarly to `joinVertices`
except that the user defined `map` function is applied to all vertices and can change the vertex
property type. Because not all vertices may have a matching value in the input RDD the `map`
function takes an `Option` type. For example, we can setup a graph for PageRank by initializing
function takes an `Option` type. For example, we can set up a graph for PageRank by initializing
vertex properties with their `outDegree`.


Expand Down Expand Up @@ -969,7 +969,7 @@ A vertex is part of a triangle when it has two adjacent vertices with an edge be
# Examples

Suppose I want to build a graph from some text files, restrict the graph
to important relationships and users, run page-rank on the sub-graph, and
to important relationships and users, run page-rank on the subgraph, and
then finally return attributes associated with the top users. I can do
all of this in just a few lines with GraphX:

Expand Down
4 changes: 2 additions & 2 deletions docs/job-scheduling.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ run tasks and store data for that application. If multiple users need to share y
different options to manage allocation, depending on the cluster manager.

The simplest option, available on all cluster managers, is _static partitioning_ of resources. With
this approach, each application is given a maximum amount of resources it can use, and holds onto them
this approach, each application is given a maximum amount of resources it can use and holds onto them
for its whole duration. This is the approach used in Spark's [standalone](spark-standalone.html)
and [YARN](running-on-yarn.html) modes, as well as the
[coarse-grained Mesos mode](running-on-mesos.html#mesos-run-modes).
Expand Down Expand Up @@ -230,7 +230,7 @@ properties:
* `minShare`: Apart from an overall weight, each pool can be given a _minimum shares_ (as a number of
CPU cores) that the administrator would like it to have. The fair scheduler always attempts to meet
all active pools' minimum shares before redistributing extra resources according to the weights.
The `minShare` property can therefore be another way to ensure that a pool can always get up to a
The `minShare` property can, therefore, be another way to ensure that a pool can always get up to a
certain number of resources (e.g. 10 cores) quickly without giving it a high priority for the rest
of the cluster. By default, each pool's `minShare` is 0.

Expand Down
2 changes: 1 addition & 1 deletion docs/ml-advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ Quasi-Newton methods in this case. This fallback is currently always enabled for
L1 regularization is applied (i.e. $\alpha = 0$), there exists an analytical solution and either Cholesky or Quasi-Newton solver may be used. When $\alpha > 0$ no analytical
solution exists and we instead use the Quasi-Newton solver to find the coefficients iteratively.

In order to make the normal equation approach efficient, `WeightedLeastSquares` requires that the number of features be no more than 4096. For larger problems, use L-BFGS instead.
In order to make the normal equation approach efficient, `WeightedLeastSquares` requires that the number of features is no more than 4096. For larger problems, use L-BFGS instead.

## Iteratively reweighted least squares (IRLS)

Expand Down
6 changes: 3 additions & 3 deletions docs/ml-classification-regression.md
Original file line number Diff line number Diff line change
Expand Up @@ -420,7 +420,7 @@ Refer to the [R API docs](api/R/spark.svmLinear.html) for more details.

[OneVsRest](http://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently. It is also known as "One-vs-All."

`OneVsRest` is implemented as an `Estimator`. For the base classifier it takes instances of `Classifier` and creates a binary classification problem for each of the k classes. The classifier for class i is trained to predict whether the label is i or not, distinguishing class i from all other classes.
`OneVsRest` is implemented as an `Estimator`. For the base classifier, it takes instances of `Classifier` and creates a binary classification problem for each of the k classes. The classifier for class i is trained to predict whether the label is i or not, distinguishing class i from all other classes.

Predictions are done by evaluating each binary classifier and the index of the most confident classifier is output as label.

Expand Down Expand Up @@ -908,7 +908,7 @@ Refer to the [R API docs](api/R/spark.survreg.html) for more details.
belongs to the family of regression algorithms. Formally isotonic regression is a problem where
given a finite set of real numbers `$Y = {y_1, y_2, ..., y_n}$` representing observed responses
and `$X = {x_1, x_2, ..., x_n}$` the unknown response values to be fitted
finding a function that minimises
finding a function that minimizes

`\begin{equation}
f(x) = \sum_{i=1}^n w_i (y_i - x_i)^2
Expand All @@ -927,7 +927,7 @@ We implement a
which uses an approach to
[parallelizing isotonic regression](http://doi.org/10.1007/978-3-642-99789-1_10).
The training input is a DataFrame which contains three columns
label, features and weight. Additionally IsotonicRegression algorithm has one
label, features and weight. Additionally, IsotonicRegression algorithm has one
optional parameter called $isotonic$ defaulting to true.
This argument specifies if the isotonic regression is
isotonic (monotonically increasing) or antitonic (monotonically decreasing).
Expand Down
2 changes: 1 addition & 1 deletion docs/ml-collaborative-filtering.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ but the ids must be within the integer value range.

### Explicit vs. implicit feedback

The standard approach to matrix factorization based collaborative filtering treats
The standard approach to matrix factorization-based collaborative filtering treats
the entries in the user-item matrix as *explicit* preferences given by the user to the item,
for example, users giving ratings to movies.

Expand Down
2 changes: 1 addition & 1 deletion docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -1174,7 +1174,7 @@ for more details on the API.
## SQLTransformer

`SQLTransformer` implements the transformations which are defined by SQL statement.
Currently we only support SQL syntax like `"SELECT ... FROM __THIS__ ..."`
Currently, we only support SQL syntax like `"SELECT ... FROM __THIS__ ..."`
where `"__THIS__"` represents the underlying table of the input dataset.
The select clause specifies the fields, constants, and expressions to display in
the output, and can be any select clause that Spark SQL supports. Users can also
Expand Down
2 changes: 1 addition & 1 deletion docs/ml-migration-guides.md
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,7 @@ rather than using the old parameter class `Strategy`. These new training method
separate classification and regression, and they replace specialized parameter types with
simple `String` types.

Examples of the new, recommended `trainClassifier` and `trainRegressor` are given in the
Examples of the new recommended `trainClassifier` and `trainRegressor` are given in the
[Decision Trees Guide](mllib-decision-tree.html#examples).

## From 0.9 to 1.0
Expand Down
2 changes: 1 addition & 1 deletion docs/ml-tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ Refer to the [`CrossValidator` Python docs](api/python/pyspark.ml.html#pyspark.m

In addition to `CrossValidator` Spark also offers `TrainValidationSplit` for hyper-parameter tuning.
`TrainValidationSplit` only evaluates each combination of parameters once, as opposed to k times in
the case of `CrossValidator`. It is therefore less expensive,
the case of `CrossValidator`. It is, therefore, less expensive,
but will not produce as reliable results when the training dataset is not sufficiently large.

Unlike `CrossValidator`, `TrainValidationSplit` creates a single (training, test) dataset pair.
Expand Down
2 changes: 1 addition & 1 deletion docs/mllib-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ The following code snippets can be executed in `spark-shell`.
In the following example after loading and parsing data, we use the
[`KMeans`](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) object to cluster the data
into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within
Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact the
Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact, the
optimal *k* is usually one where there is an "elbow" in the WSSSE graph.

Refer to the [`KMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and [`KMeansModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeansModel) for details on the API.
Expand Down
4 changes: 2 additions & 2 deletions docs/mllib-collaborative-filtering.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ following parameters:

### Explicit vs. implicit feedback

The standard approach to matrix factorization based collaborative filtering treats
The standard approach to matrix factorization-based collaborative filtering treats
the entries in the user-item matrix as *explicit* preferences given by the user to the item,
for example, users giving ratings to movies.

Expand Down Expand Up @@ -60,7 +60,7 @@ best parameter learned from a sampled subset to the full dataset and expect simi
<div class="codetabs">

<div data-lang="scala" markdown="1">
In the following example we load rating data. Each row consists of a user, a product and a rating.
In the following example, we load rating data. Each row consists of a user, a product and a rating.
We use the default [ALS.train()](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$)
method which assumes ratings are explicit. We evaluate the
recommendation model by measuring the Mean Squared Error of rating prediction.
Expand Down
2 changes: 1 addition & 1 deletion docs/mllib-data-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -350,7 +350,7 @@ which is a tuple of `(Int, Int, Matrix)`.
***Note***

The underlying RDDs of a distributed matrix must be deterministic, because we cache the matrix size.
In general the use of non-deterministic RDDs can lead to errors.
In general, the use of non-deterministic RDDs can lead to errors.

### RowMatrix

Expand Down
2 changes: 1 addition & 1 deletion docs/mllib-dimensionality-reduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ The same code applies to `IndexedRowMatrix` if `U` is defined as an

[Principal component analysis (PCA)](http://en.wikipedia.org/wiki/Principal_component_analysis) is a
statistical method to find a rotation such that the first coordinate has the largest variance
possible, and each succeeding coordinate in turn has the largest variance possible. The columns of
possible, and each succeeding coordinate, in turn, has the largest variance possible. The columns of
the rotation matrix are called principal components. PCA is used widely in dimensionality reduction.

`spark.mllib` supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors.
Expand Down
2 changes: 1 addition & 1 deletion docs/mllib-evaluation-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ of the model on some criteria, which depends on the application and its requirem
suite of metrics for the purpose of evaluating the performance of machine learning models.

Specific machine learning algorithms fall under broader types of machine learning applications like classification,
regression, clustering, etc. Each of these types have well established metrics for performance evaluation and those
regression, clustering, etc. Each of these types have well-established metrics for performance evaluation and those
metrics that are currently available in `spark.mllib` are detailed in this section.

## Classification model evaluation
Expand Down
2 changes: 1 addition & 1 deletion docs/mllib-feature-extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top
\]`
where $V$ is the vocabulary size.

The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$
The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$
is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec,
we used hierarchical softmax, which reduced the complexity of computing of $\log p(w_i | w_j)$ to
$O(\log(V))$
Expand Down
Loading