This project focuses on cleaning a dataset of global layoffs from major companies (2020-2023). Using SQL, I ensured the data is clean, standardized, and ready for analysis.
π Check out the SQL queries used in this project: SQL Cleaning Scripts
Handling real-world data often involves dealing with duplicates, inconsistencies, and missing values. This project applies a structured SQL-based cleaning process to prepare the layoffs dataset for meaningful analysis.
- Used
ROW_NUMBER()withPARTITION BYto identify duplicate records - Created a staging table (
layoffs_staging2) to store deduplicated data - Ensured only unique entries were retained
- Fixed spelling inconsistencies in company names, locations, and industries
- Standardized text formats
- Replaced missing values where necessary
- Removed rows with crucial missing data
- Removed incomplete and erroneous records
- π’ MySQL Workbench for SQL queries and data cleaning
- π Kaggle for dataset sourcing
- The importance of data cleaning in ensuring accuracy and reliability for analysis
- How to use SQL functions like
ROW_NUMBER()to detect duplicates efficiently - Best practices for handling missing data and maintaining data integrity
- The significance of standardizing values to avoid inconsistencies
Data cleaning is a crucial step in any data analysis project. By applying structured SQL techniques, we transformed raw, messy data into a reliable dataset suitable for insights and decision-making.
This project reinforced the importance of systematic data cleaning, preparing me for future data-driven projects. Next steps include automating the cleaning process and integrating it with visualization tools for deeper analysis.