Learning PySpark

Link to Video

Jupyter Notebook

View the Jupyter Notebook for the detailed code on implementing PySpark.

Insurance Notebooks

#	Notebook	Description
1.	Insurance Price Prediction	Predicting the price of health insurance using Linear Regression
2.	Insurance Risk Score Prediction	Predicting the risk score (Low, Medium, High) of insurance using Random Forest
3.	Insurance Fraud Detection	Detecting fraud in insurance using Random Forest

Usage

After cloning the repository, run the following command to create a virtual environment:
```
python -m venv .venv
```
Install the required packages:
```
pip install -r requirements.txt
```

What is Apache Spark?

Fast and general-purpose cluster computing system
Speed: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
General-purpose: Combine SQL, streaming, and complex analytics
Powerful caching
Real-time stream processing
It provides high-level APIs in Java, Scala, Python and R

Spark Ecosystem

Engine:
- Spark Core: The base engine for large-scale parallel and distributed data processing
Management:
- Yarn: Resource management
- Mesos: Cluster management
Libraries:
- Spark SQL: SQL and structured data processing
- MLlib: Machine learning
- GraphX: Graph processing
- Spark Streaming: Real-time data processing
Programming:
- Scala, Java, Python, R
Storage:
- HDFS, Local FS (file system), RDBMS, NoSQL, Amazon S3 etc.

RDD (Resilient Distributed Dataset)

Fault-tolerant collection of elements that can be operated on in parallel
Immutable distributed collection of objects

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Insurance Notebooks		Insurance Notebooks
data		data
.gitignore		.gitignore
PySpark Demo.ipynb		PySpark Demo.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning PySpark

Link to Video

Jupyter Notebook

Insurance Notebooks

Usage

What is Apache Spark?

Spark Ecosystem

RDD (Resilient Distributed Dataset)

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Hardvan/Learning-PySpark

Folders and files

Latest commit

History

Repository files navigation

Learning PySpark

Link to Video

Jupyter Notebook

Insurance Notebooks

Usage

What is Apache Spark?

Spark Ecosystem

RDD (Resilient Distributed Dataset)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages