View the Jupyter Notebook for the detailed code on implementing PySpark.
# | Notebook | Description |
---|---|---|
1. | Insurance Price Prediction | Predicting the price of health insurance using Linear Regression |
2. | Insurance Risk Score Prediction | Predicting the risk score (Low, Medium, High) of insurance using Random Forest |
3. | Insurance Fraud Detection | Detecting fraud in insurance using Random Forest |
-
After cloning the repository, run the following command to create a virtual environment:
python -m venv .venv
-
Install the required packages:
pip install -r requirements.txt
- Fast and general-purpose cluster computing system
- Speed: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
- General-purpose: Combine SQL, streaming, and complex analytics
- Powerful caching
- Real-time stream processing
- It provides high-level APIs in Java, Scala, Python and R
- Engine:
- Spark Core: The base engine for large-scale parallel and distributed data processing
- Management:
- Yarn: Resource management
- Mesos: Cluster management
- Libraries:
- Spark SQL: SQL and structured data processing
- MLlib: Machine learning
- GraphX: Graph processing
- Spark Streaming: Real-time data processing
- Programming:
- Scala, Java, Python, R
- Storage:
- HDFS, Local FS (file system), RDBMS, NoSQL, Amazon S3 etc.
- Fault-tolerant collection of elements that can be operated on in parallel
- Immutable distributed collection of objects