Welcome! This course assumes you have some decent programming knowledge. We're going to be using Python, but any knowledge should be fine.
Before we can start using all the cool tools like Jupyter, NumPy, and pandas, we need to make sure Python and pip are installed. These two are the foundation for everything else. Don't worry, it’s pretty straightforward, and I’ll guide you through the steps.
First things first, we need to install Python. Python is the programming language we’ll be using throughout the course.
- Head over to the official Python website here.
- Download the latest version of Python for Windows (it’ll detect your operating system automatically).
- Important: During the installation, make sure to check the box that says Add Python to PATH. This will save you a lot of headaches later on.
- Follow the installation instructions, clicking through the prompts until it’s done.
If you’re on a Mac, you’ve got Python pre-installed, but it’s often an older version, so let’s update it.
- Again, head to the Python website and download the latest version for macOS.
- Run the installer and follow the instructions.
- That’s it – simple!
Most Linux distributions come with Python pre-installed, but to check, open your terminal and type:
python3 --versionIf it shows a version number (e.g., Python 3.8.5), you’re good to go. If not, install it with:
sudo apt-get update
sudo apt-get install python3pip is Python’s package manager, and it lets us install libraries like NumPy and pandas. It usually comes with Python, but we’ll double-check just to be sure.
Open your terminal (or command prompt on Windows) and type:
pip --versionIf you see something like this, you’re all set:
pip 21.0.1 from /usr/local/lib/python3.9/site-packages (python 3.9)
If pip isn’t installed, here’s how to get it sorted.
- Windows/macOS: It should come with Python. If you’re missing it, try reinstalling Python (make sure you check the Add Python to PATH box).
- Linux: Run the following command in your terminal:
sudo apt-get install python3-pipLet’s make sure everything is working properly.
- Open your terminal or command prompt.
- Type the following to check your Python installation:
python --versionYou should see something like this:
Python 3.9.1
If you see something similar, that means Python is installed correctly.
- Now check if pip is working:
pip --versionYou should get something like:
pip 21.0.1
You’re all set now. Python and pip are installed, and you're ready to dive into the lessons. If you run into any issues, check out the official Python documentation for troubleshooting tips here.
Now, whenever we mention running commands like pip install pandas, you’ll know exactly what to do.
- Overview of Jupyter Notebook
- Navigating the interface: Code cells, Markdown, and output
- Writing basic Python code in Jupyter
- What is NumPy and why is it important?
- Creating arrays:
numpy.array() - Basic operations on NumPy arrays (element-wise operations, reshaping)
- Hands-on exercise: Basic array manipulations
- What is pandas and why is it used?
- Understanding Series and DataFrames
- Creating a pandas DataFrame from scratch
- Hands-on exercise: Creating a simple DataFrame and performing basic operations
- Accessing rows and columns using
.loc[]and.iloc[] - Boolean indexing and filtering
- Hands-on exercise: Filtering data from a DataFrame
- Why use NumPy arrays in data engineering tasks
- Using
numpy.where()for conditional operations - Hands-on exercise: Using NumPy to optimize operations in a pandas DataFrame
- Handling missing data (
dropna(),fillna()) - Renaming columns, dropping columns, and replacing values
- Hands-on exercise: Cleaning a raw dataset
- Using
pd.to_datetime()for date conversion - Extracting date components (year, month, etc.)
- Hands-on exercise: Parsing and analyzing time-series data
- Understanding the
groupby()function - Aggregating data using
sum(),mean(),count() - Hands-on exercise: Aggregating sales data by region or product
- Combining data with
merge(),concat(), andjoin() - Understanding different types of joins (inner, outer, left, right)
- Hands-on exercise: Joining two datasets (e.g., customer data and order data)
- Chunking data using
chunksizein pandas - Memory-efficient operations in pandas
- Hands-on exercise: Reading and processing large CSV files
- Reading and writing data from/to CSV, Excel, JSON, SQL databases
- Hands-on exercise: Importing data from different sources and exporting the results
- Building a simple ETL pipeline with pandas
- Data extraction, transformation, and loading processes
- Hands-on exercise: Create a mini-ETL pipeline using real-world data
- Identifying and handling duplicates
- Data validation and constraints
- Hands-on exercise: Detect and fix quality issues in a dataset
- What is vectorization and why is it important?
- Using NumPy’s broadcasting for efficient data manipulation
- Hands-on exercise: Implement vectorized operations on large datasets
- Using
apply(),map(), andapplymap()for transformation - Lambda functions and custom transformations
- Hands-on exercise: Apply custom functions across DataFrame columns
- Introduction to rolling windows and expanding windows
- Using window functions to calculate moving averages and other time-based operations
- Hands-on exercise: Implement rolling statistics on time-series data
- Optimizing pandas for large-scale data
- Introduction to Dask and PySpark as pandas alternatives for distributed environments
- Hands-on exercise: Explore Dask or PySpark for large-scale DataFrame operations
- Connecting pandas with SQL databases using
SQLAlchemy - Loading data from and into SQL databases
- Hands-on exercise: Extract data from a database, transform it, and load it back
- Define the final project: Design and implement a data engineering pipeline that extracts, cleans, transforms, and loads a dataset into a database
- Use real-world data (e.g., weather data, sales data, or financial data)
- Guided implementation of the project
- Cover all steps: data loading, cleaning, transformation, aggregation, and final storage
- Present the project solution, walk through each component, and discuss challenges and solutions