1- #Fault Tolerance
1+ ---
2+ post_title : Fault Tolerance
3+ menu_order : 100
4+ feature_maturity : stable
5+ enterprise : ' yes'
6+ ---
27
38Failures such as host, network, JVM, or application failures can
49affect the behavior of three types of Spark components:
@@ -7,7 +12,7 @@ affect the behavior of three types of Spark components:
712- Batch Jobs
813- Streaming Jobs
914
10- ## DC/OS Spark Service
15+ # DC/OS Spark Service
1116
1217The DC/OS Spark service runs in Marathon and includes the Mesos Cluster
1318Dispatcher and the Spark History Server. The Dispatcher manages jobs
@@ -16,19 +21,19 @@ The Spark History Server reads event logs from HDFS. If the service
1621dies, Marathon will restart it, and it will reload data from these
1722highly available stores.
1823
19- ## Batch Jobs
24+ # Batch Jobs
2025
2126Batch jobs are resilient to executor failures, but not driver
2227failures. The Dispatcher will restart a driver if you submit with
2328` --supervise ` .
2429
25- ### Driver
30+ ## Driver
2631
2732When the driver fails, executors are terminated, and the entire Spark
2833application fails. If you submitted your job with ` --supervise ` , then
2934the Dispatcher will restart the job.
3035
31- ### Executors
36+ ## Executors
3237
3338Batch jobs are resilient to executor failure. Upon failure, cached
3439data, shuffle files, and partially computed RDDs are lost. However,
@@ -37,7 +42,7 @@ recompute this data from the original data source, caches, or shuffle
3742files. There is a performance cost as data is recomputed, but an
3843executor failure will not cause a job to fail.
3944
40- ## Streaming Jobs
45+ # Streaming Jobs
4146
4247Whereas batch jobs run once and can usually be restarted upon failure,
4348streaming jobs often need to run constantly. The application must
@@ -50,30 +55,30 @@ you can use the Direct Kafka API.
5055For exactly once processing semantics, you must use the Direct Kafka
5156API. All other receivers provide at least once semantics.
5257
53- ### Failures
58+ ## Failures
5459
5560There are two types of failures:
5661
5762- Driver
5863- Executor
5964
60- ### Job Features
65+ ## Job Features
6166
6267There are a few variables that affect the reliability of your job:
6368
6469- [ WAL] [ 1 ]
6570- [ Receiver reliability] [ 2 ]
6671- [ Storage level] [ 3 ]
6772
68- ### Reliability Features
73+ ## Reliability Features
6974
7075The two reliability features of a job are data loss and processing
7176semantics. Data loss occurs when the source sends data, but the job
7277fails to process it. Processing semantics describe how many times a
7378received message is processed by the job. It can be either "at least
7479once" or "exactly once"
7580
76- #### Data loss
81+ ### Data loss
7782
7883A Spark Job loses data when delivered data does not get processed.
7984The following is a list of configurations with increasing data
@@ -140,7 +145,7 @@ preservation guarantees:
140145 executor failure => ** no data loss**
141146 driver failure => ** no data loss**
142147
143- #### Processing semantics
148+ ### Processing semantics
144149
145150Processing semantics apply to how many times received messages get
146151processed. With Spark Streaming, this can be either "at least once"
0 commit comments