Skip to content

Commit 38607d8

Browse files
author
Suzanne Scala
authored
Merge pull request apache#84 from mesosphere/ss/auto-update
Ss/auto update
2 parents 864758b + a59deca commit 38607d8

18 files changed

+815
-741
lines changed

docs/custom-docker.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
post_title: Interactive Spark Shell
3+
menu_order: 95
4+
enterprise: 'yes'
5+
---
6+
7+
**Note:** Custom Docker images are not supported by Mesosphere.
8+
9+
You can customize the Docker image in which Spark runs by extending
10+
the standard Spark Docker image. In this way, you can install your own
11+
libraries, such as a custom Python library.
12+
13+
1. In your Dockerfile, extend from the standard Spark image and add your
14+
customizations:
15+
16+
```
17+
FROM mesosphere/spark:1.0.4-2.0.1
18+
RUN apt-get install -y python-pip
19+
RUN pip install requests
20+
```
21+
22+
1. Then, build an image from your Dockerfile.
23+
24+
$ docker build -t username/image:tag .
25+
$ docker push username/image:tag
26+
27+
1. Reference your custom Docker image with the `--docker-image` option
28+
when running a Spark job.
29+
30+
$ dcos spark run --docker-image=myusername/myimage:v1 --submit-args="http://external.website/mysparkapp.py 30"

docs/fault-tolerance.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,9 @@
1-
#Fault Tolerance
1+
---
2+
post_title: Fault Tolerance
3+
menu_order: 100
4+
feature_maturity: stable
5+
enterprise: 'yes'
6+
---
27

38
Failures such as host, network, JVM, or application failures can
49
affect the behavior of three types of Spark components:
@@ -7,7 +12,7 @@ affect the behavior of three types of Spark components:
712
- Batch Jobs
813
- Streaming Jobs
914

10-
## DC/OS Spark Service
15+
# DC/OS Spark Service
1116

1217
The DC/OS Spark service runs in Marathon and includes the Mesos Cluster
1318
Dispatcher and the Spark History Server. The Dispatcher manages jobs
@@ -16,19 +21,19 @@ The Spark History Server reads event logs from HDFS. If the service
1621
dies, Marathon will restart it, and it will reload data from these
1722
highly available stores.
1823

19-
## Batch Jobs
24+
# Batch Jobs
2025

2126
Batch jobs are resilient to executor failures, but not driver
2227
failures. The Dispatcher will restart a driver if you submit with
2328
`--supervise`.
2429

25-
### Driver
30+
## Driver
2631

2732
When the driver fails, executors are terminated, and the entire Spark
2833
application fails. If you submitted your job with `--supervise`, then
2934
the Dispatcher will restart the job.
3035

31-
### Executors
36+
## Executors
3237

3338
Batch jobs are resilient to executor failure. Upon failure, cached
3439
data, shuffle files, and partially computed RDDs are lost. However,
@@ -37,7 +42,7 @@ recompute this data from the original data source, caches, or shuffle
3742
files. There is a performance cost as data is recomputed, but an
3843
executor failure will not cause a job to fail.
3944

40-
## Streaming Jobs
45+
# Streaming Jobs
4146

4247
Whereas batch jobs run once and can usually be restarted upon failure,
4348
streaming jobs often need to run constantly. The application must
@@ -50,30 +55,30 @@ you can use the Direct Kafka API.
5055
For exactly once processing semantics, you must use the Direct Kafka
5156
API. All other receivers provide at least once semantics.
5257

53-
### Failures
58+
## Failures
5459

5560
There are two types of failures:
5661

5762
- Driver
5863
- Executor
5964

60-
### Job Features
65+
## Job Features
6166

6267
There are a few variables that affect the reliability of your job:
6368

6469
- [WAL][1]
6570
- [Receiver reliability][2]
6671
- [Storage level][3]
6772

68-
### Reliability Features
73+
## Reliability Features
6974

7075
The two reliability features of a job are data loss and processing
7176
semantics. Data loss occurs when the source sends data, but the job
7277
fails to process it. Processing semantics describe how many times a
7378
received message is processed by the job. It can be either "at least
7479
once" or "exactly once"
7580

76-
#### Data loss
81+
### Data loss
7782

7883
A Spark Job loses data when delivered data does not get processed.
7984
The following is a list of configurations with increasing data
@@ -140,7 +145,7 @@ preservation guarantees:
140145
executor failure => **no data loss**
141146
driver failure => **no data loss**
142147

143-
#### Processing semantics
148+
### Processing semantics
144149

145150
Processing semantics apply to how many times received messages get
146151
processed. With Spark Streaming, this can be either "at least once"

docs/hdfs.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
---
2+
post_title: Configure Spark for HDFS
3+
nav_title: HDFS
4+
menu_order: 20
5+
enterprise: 'yes'
6+
---
7+
8+
To configure Spark for a specific HDFS cluster, configure
9+
`hdfs.config-url` to be a URL that serves your `hdfs-site.xml` and
10+
`core-site.xml`. For example:
11+
12+
{
13+
"hdfs": {
14+
"config-url": "http://mydomain.com/hdfs-config"
15+
}
16+
}
17+
18+
19+
where `http://mydomain.com/hdfs-config/hdfs-site.xml` and
20+
`http://mydomain.com/hdfs-config/core-site.xml` are valid
21+
URLs.[Learn more][8].
22+
23+
For DC/OS HDFS, these configuration files are served at
24+
`http://<hdfs.framework-name>.marathon.mesos:<port>/v1/connection`, where
25+
`<hdfs.framework-name>` is a configuration variable set in the HDFS
26+
package, and `<port>` is the port of its marathon app.
27+
28+
# HDFS Kerberos
29+
30+
You can access external (i.e. non-DC/OS) Kerberos-secured HDFS clusters
31+
from Spark on Mesos.
32+
33+
## HDFS Configuration
34+
35+
Once you've set up a Kerberos-enabled HDFS cluster, configure Spark to
36+
connect to it. See instructions [here](#hdfs).
37+
38+
## Installation
39+
40+
1. A krb5.conf file tells Spark how to connect to your KDC. Base64
41+
encode this file:
42+
43+
$ cat krb5.conf | base64
44+
45+
1. Add the following to your JSON configuration file to enable
46+
Kerberos in Spark:
47+
48+
{
49+
"security": {
50+
"kerberos": {
51+
"krb5conf": "<base64 encoding>"
52+
}
53+
}
54+
}
55+
56+
1. If you've enabled the history server via `history-server.enabled`,
57+
you must also configure the principal and keytab for the history
58+
server. **WARNING**: The keytab contains secrets, so you should
59+
ensure you have SSL enabled while installing DC/OS Spark.
60+
61+
Base64 encode your keytab:
62+
63+
$ cat spark.keytab | base64
64+
65+
And add the following to your configuration file:
66+
67+
{
68+
"history-server": {
69+
"kerberos": {
70+
"principal": "spark@REALM",
71+
"keytab": "<base64 encoding>"
72+
}
73+
}
74+
}
75+
76+
1. Install Spark with your custom configuration, here called
77+
`options.json`:
78+
79+
$ dcos package install --options=options.json spark
80+
81+
## Job Submission
82+
83+
To authenticate to a Kerberos KDC, DC/OS Spark supports keytab
84+
files as well as ticket-granting tickets (TGTs).
85+
86+
Keytabs are valid infinitely, while tickets can expire. Especially for
87+
long-running streaming jobs, keytabs are recommended.
88+
89+
### Keytab Authentication
90+
91+
Submit the job with the keytab:
92+
93+
$ dcos spark run --submit-args="--principal user@REALM --keytab <keytab-file-path>..."
94+
95+
### TGT Authentication
96+
97+
Submit the job with the ticket:
98+
99+
$ dcos spark run --principal user@REALM --tgt <ticket-file-path>
100+
101+
**Note:** These credentials are security-critical. We highly
102+
recommended configuring SSL encryption between the Spark
103+
components when accessing Kerberos-secured HDFS clusters. See the Security section for information on how to do this.
104+
105+
[8]: http://spark.apache.org/docs/latest/configuration.html#inheriting-hadoop-cluster-configuration

docs/history-server.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
---
2+
post_title: History Server
3+
menu_order: 30
4+
enterprise: 'yes'
5+
---
6+
7+
DC/OS Spark includes the [Spark history server][3]. Because the history
8+
server requires HDFS, you must explicitly enable it.
9+
10+
1. Install HDFS first:
11+
12+
$ dcos package install hdfs
13+
14+
**Note:** HDFS requires 5 private nodes.
15+
16+
1. Create a history HDFS directory (default is `/history`). [SSH into
17+
your cluster][10] and run:
18+
19+
$ hdfs dfs -mkdir /history
20+
21+
1. Enable the history server when you install Spark. Create a JSON
22+
configuration file. Here we call it `options.json`:
23+
24+
{
25+
"history-server": {
26+
"enabled": true
27+
}
28+
}
29+
30+
1. Install Spark:
31+
32+
$ dcos package install spark --options=options.json
33+
34+
1. Run jobs with the event log enabled:
35+
36+
$ dcos spark run --submit-args="-Dspark.eventLog.enabled=true -Dspark.eventLog.dir=hdfs://hdfs/history ... --class MySampleClass http://external.website/mysparkapp.jar"
37+
38+
1. Visit your job in the dispatcher at
39+
`http://<dcos_url>/service/spark/Dispatcher/`. It will include a link
40+
to the history server entry for that job.
41+
42+
[3]: http://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact
43+
[10]: https://docs.mesosphere.com/1.8/administration/access-node/sshcluster/

docs/index.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
---
2+
post_title: Spark
3+
menu_order: 110
4+
enterprise: 'yes'
5+
---
6+
7+
Apache Spark is a fast and general-purpose cluster computing system for big
8+
data. It provides high-level APIs in Scala, Java, Python, and R, and
9+
an optimized engine that supports general computation graphs for data
10+
analysis. It also supports a rich set of higher-level tools including
11+
Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX
12+
for graph processing, and Spark Streaming for stream processing. For
13+
more information, see the [Apache Spark documentation][1].
14+
15+
Apache DC/OS Spark consists of
16+
[Apache Spark with a few custom commits][17]
17+
along with
18+
[DC/OS-specific packaging][18].
19+
20+
DC/OS Spark includes:
21+
22+
* [Mesos Cluster Dispatcher][2]
23+
* [Spark History Server][3]
24+
* DC/OS Spark CLI
25+
* Interactive Spark shell
26+
27+
# Benefits
28+
29+
* Utilization: DC/OS Spark leverages Mesos to run Spark on the same
30+
cluster as other DC/OS services
31+
* Improved efficiency
32+
* Simple Management
33+
* Multi-team support
34+
* Interactive analytics through notebooks
35+
* UI integration
36+
* Security
37+
38+
# Features
39+
40+
* Multiversion support
41+
* Run multiple Spark dispatchers
42+
* Run against multiple HDFS clusters
43+
* Backports of scheduling improvements
44+
* Simple installation of all Spark components, including the
45+
dispatcher and the history server
46+
* Integration of the dispatcher and history server
47+
* Zeppelin integration
48+
* Kerberos and SSL support
49+
50+
# Related Services
51+
52+
* [HDFS][4]
53+
* [Kafka][5]
54+
* [Zeppelin][6]
55+
56+
[1]: http://spark.apache.org/documentation.html
57+
[2]: http://spark.apache.org/docs/latest/running-on-mesos.html#cluster-mode
58+
[3]: http://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact
59+
[4]: https://docs.mesosphere.com/1.8/usage/service-guides/hdfs/
60+
[5]: https://docs.mesosphere.com/1.8/usage/service-guides/kafka/
61+
[6]: https://zeppelin.incubator.apache.org/
62+
[17]: https://github.com/mesopshere/spark
63+
[18]: https://github.com/mesopshere/spark-build

0 commit comments

Comments
 (0)