This Data Cleaning & Preprocessing Toolkit is a Python library that makes the data wrangling phase of your Data Science projects as smooth as butter. It is not meant to replace your favorite libraries (pandas, nltk); it just extends their use with a few functions that I found myself writing over and over. With a focus on code reusability and ease of use, this toolkit is the perfect addition to your Data Science arsenal.
"In God we trust. All others must bring data." - W. Edwards Deming (Ostensibly).
- Missing Value Handling: Say goodbye to missing data woes.
- Outlier Management: Handle outliers like a pro.
- Feature Engineering: Transform raw data into insights.
- Text Cleaning: Get your text data in shape.
- Data Integrity Checks: Ensure your data is clean and reliable.
- Class Imbalance Handling: Balance your datasets like a yogi.
-
Open your terminal and run:
python3 -m venv SDPython
This will create a new Python virtual environment named SDPython.
-
Activate the virtual environment:
-
macOS and Linux:
source SDPython/bin/activate -
Windows:
.\SDPython\Scripts\Activate
-
After activating the virtual environment, navigate to the directory where requirements.txt is located and run:
pip install -r requirements.txt- missing_values.py: Impute and manage missing values.
- outliers.py: Detect and handle outliers effectively.
- feature_engineering.py: Tools for feature creation and transformation.
- text_cleaning.py: Essential text cleaning operations.
- integrity_checks.py: Functions for checking data integrity.
- imbalance_handling.py: Address class imbalance problems.
Check out the
examples/folder for usage examples!
Visit the examples/ folder for Jupyter Notebooks demonstrating how to use each module.
🔍 Each example walks you through the process, explaining every step.
We love contributions! Please see CONTRIBUTING.md for details on how you can contribute.
This project is licensed under the MIT License - see the LICENSE.md file for details.
- Website - jasongodfrey.info
- Email - jason.godfrey@accelerate.com
If you like this project, please give it a ⭐ on GitHub! 😊
