Start Docker Environment

Setup Secure Shell (SSH)

MacOS + Linux ONLY

Navigate your browser to ...

MacOS + Linux ONLY:  http://advancedspark.com/keys/pipeline-training-gce.pem

Create the ~/.ssh/ directory if it doesn't already exist

mkdir -p ~/.ssh

Copy the downloaded file to the /Users/<username>/.ssh directory

cp ~/Downloads/pipeline-training-gce.pem ~/.ssh

Change the permission on the .pem file so that ssh doesn't complain

chmod 600 ~/.ssh/pipeline-training-gce.pem

Windows ONLY

Navigate your browser to ...

Windows ONLY:  http://advancedspark.com/keys/pipeline-training-gce.ppk

Download the .ppk file to a well-known directory

SSH Into Your Instance

Username: pipeline-training

Password: password9

MacOS + Linux ONLY

ssh -i ~/.ssh/pipeline-training-gce.pem pipeline-training@<your-cloud-ip>

Windows ONLY

Download Putty if you don't already have it
Enter pipeline-training@<your-cloud-ip> in the Host Name text box

Putty Host IP

Select the location of the .ppk under Connection -> SSH -> Auth and click Open

Putty ppk File

Select Yes to accept the scary security alert

Putty ppk accept security alert

Type password9 for the passphrase to avoid the scary, confusing error message

ppk type password

You're in!

ppk you are in

Verify Docker Images

Run the following to verify that you have the latest fluxcapacitor/pipeline Docker image

sudo docker images

### EXAMPLE OUTPUT ###
...
REPOSITORY               TAG                 IMAGE ID            CREATED             SIZE
fluxcapacitor/pipeline   latest              c392786d2afc        3 mins ago          13.17 GB

If you don't see the fluxcapacitor/pipeline Docker image listed, you will need to do sudo docker pull pipeline/fluxcapacitor

Advanced notes for those trying this at home:

At this point, you should be SSH'd into your specific cloud instance
Verify that you see the pipeline-training@<something> prompt
The cloud instance should be a fresh instance with no external processes running or bound to any ports
The Docker container will bind to many ports including port 80, so make sure even apache2 is disabled

Run Docker Container

Run the following command

sudo docker run -itd --privileged --name pipeline --net=host -m 50g -e "SPRING_PROFILES_ACTIVE=local" fluxcapacitor/pipeline

### EXAMPLE OUTPUT ###
WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
...
(Ignore This ^^^ WARNING ^^^)

Advanced notes for those trying this at home:

You may need to adjust the -m 50g memory if you're not on a cloud instance with 50+ GB of RAM
We highly recommend 50+ GB of RAM

Shell into you Docker Container

sudo docker exec -it pipeline bash

Verify that you see the root@<something>:~/pipeline# prompt

Configure and Start Pipeline Environment

Note: At this point, you are inside the Docker Container
Run the following command

cd $PIPELINE_HOME && git pull && source $CONFIG_HOME/bash/pipeline.bashrc && $SCRIPTS_HOME/setup/RUNME_ONCE.sh

Wait a few mins for initialization to complete... Ignore all errors!! This may take some time.

Verify Pipeline Environment

Run jps -l and verify that most of these services are running

jps -l

### EXAMPLE OUTPUT ###
...
737 org.elasticsearch.bootstrap.Elasticsearch                   <-- ElasticSearch
738 org.jruby.Main                                              <-- Logstash
1987 org.apache.zeppelin.server.ZeppelinServer                  <-- Zeppelin
2243 org.apache.spark.deploy.worker.Worker                      <-- Spark Worker
2123 org.apache.spark.deploy.master.Master                      <-- Spark Master
3479 sun.tools.jps.Jps                                          <-- this (jps)
1529 org.apache.zookeeper.server.quorum.QuorumPeerMain          <-- ZooKeeper
1973 io.confluent.support.metrics.SupportedKafka                <-- Kafka
2555 io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain  <-- Kafka SchemaRegistry
3408 io.confluent.kafkarest.Main                                <-- Kafka REST API
6107 org.apache.flink.runtime.jobmanager.JobManager             <-- Flink Service
2547 org.apache.hadoop.util.RunJar                              <-- Hive Metastore Service (Uses MySQL as backing store)
2908 com.facebook.presto.server.PrestoServer                    <-- Presto Server
...

Run export and verify that most of these services are running

export

### EXAMPLE OUTPUT ###
...
declare -x PIPELINE_HOME="/root/pipeline"
...
declare -x MYSQL_CONNECTOR_JAR="/usr/share/java/mysql-connector-java.jar"
...
declare -x SPRING_PROFILES_ACTIVE="local"

Run the Demo Spark Streaming Apps

Run the start-spark-streaming.sh script from anywhere

start-spark-streaming.sh

### EXAMPLE OUTPUT ###
...Starting Spark Streaming App...
...logs available with "tail -f $PIPELINE_HOME/logs/spark/streaming/ratings-kafka-cassandra.log"

Verify that the Spark Streaming process is running using jps -l

jps -l

### EXAMPLE OUTPUT ###
25688 org.apache.spark.executor.CoarseGrainedExecutorBackend  <-- Spark Executor JVM
25566 org.apache.spark.deploy.SparkSubmit                     <-- Long-running Spark Streaming App

Monitor the live Spark Streaming log file

tail-spark-streaming.sh

### EXAMPLE OUTPUT ###
-------------------------------------------
Time: 1466368854000 ms
-------------------------------------------
...

Hit Ctrl-C to exit

Verify Your Environment Setup

Navigate your browser to the Demo Home Page
Follow the steps detailed on the Demo Home Page
Keep an eye on the Spark Streaming Application log file from the previous step
You should see ratings flowing through the Spark Streaming Application log file.

http://<your-cloud-ip>

Click on the navigation links at the top to familiarize yourself with the tools of the environment

Troubleshooting

Cannot Connect to Demo Home Page or Navigation Links?

Your services are either not started or you have not configured your cloud instance firewall rules (GCE) or security groups (AWS) properly
Check out this Troubleshooting Guide if you're having problems

Continue Following the Sidebar From Top to Bottom -->