Skip to content

Hardvan/Learning-PySpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning PySpark

Jupyter Notebook

View the Jupyter Notebook for the detailed code on implementing PySpark.

Insurance Notebooks

# Notebook Description
1. Insurance Price Prediction Predicting the price of health insurance using Linear Regression
2. Insurance Risk Score Prediction Predicting the risk score (Low, Medium, High) of insurance using Random Forest
3. Insurance Fraud Detection Detecting fraud in insurance using Random Forest

Usage

  1. After cloning the repository, run the following command to create a virtual environment:

    python -m venv .venv
  2. Install the required packages:

    pip install -r requirements.txt

What is Apache Spark?

  • Fast and general-purpose cluster computing system
  • Speed: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
  • General-purpose: Combine SQL, streaming, and complex analytics
  • Powerful caching
  • Real-time stream processing
  • It provides high-level APIs in Java, Scala, Python and R

Spark Ecosystem

  • Engine:
    • Spark Core: The base engine for large-scale parallel and distributed data processing
  • Management:
    • Yarn: Resource management
    • Mesos: Cluster management
  • Libraries:
    • Spark SQL: SQL and structured data processing
    • MLlib: Machine learning
    • GraphX: Graph processing
    • Spark Streaming: Real-time data processing
  • Programming:
    • Scala, Java, Python, R
  • Storage:
    • HDFS, Local FS (file system), RDBMS, NoSQL, Amazon S3 etc.

RDD (Resilient Distributed Dataset)

  • Fault-tolerant collection of elements that can be operated on in parallel
  • Immutable distributed collection of objects

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published