Ultimate Data Duplication & Similarity Analyzer — built for developers, analysts, and data engineers who need quick insights from messy, large, or structured data files.
Created by mauzware
Works on Linux 🐧 and Windows 🧩
Fast, powerful, and customizable via CLI ⚙️
- 🔁 Detects exact duplicates across rows and columns
- 🔍 Finds similar values using fuzzy matching
- 🧠 Automatically categorizes values: Numeric, Textual, Mixed, Unknown
- 📊 Outputs detailed summaries and top frequent values
- 💾 Exports reports in JSON, TXT, or XML
- 📁 Supports both CSV and XLSX files
- ⚡ Modes: Fast, Standard, and Detailed
- 🧼 Experimental support for messy CSVs (
--messy) - 🌈 Colorful CLI with optional logging and quiet/debug modes
Make sure you have Python 3.9+ installed (tested on 3.13+).
You can use either pip or pip3, whichever works on your system depending on your Python version.
Windows
git clone https://github.com/mauzware/Mauzalyzer.git
cd Mauzalyzer
pip install -r requirements.txt
python mauzalyzer.py --helpLinux Debian/Ubuntu
git clone https://github.com/mauzware/Mauzalyzer.git
cd Mauzalyzer
pip install -r requirements.txt
python mauzalyzer.py --helpKali Linux
In Kali, all required modules are already pre-installed.
git clone https://github.com/mauzware/Mauzalyzer.git
cd Mauzalyzer
python mauzalyzer.py --help
If you are missing some modules by any chance, you can install them with:
- Create a virtual environment and use: pip3 install -r requirements.txt
- Install them manually with apt: sudo apt install python3-[module_name]
Virtual Environment Setup
sudo apt update
sudo apt install python3-venv -y
git clone https://github.com/mauzware/Mauzalyzer.git
cd Mauzalyzer
python3 -m venv Mauzalyzer-env
source Mauzalyzer-env/bin/activate
pip install -r requirements.txt
deactivate
You can use either python or python3, whichever works on your system depending on your Python version.
python mauzalyzer.py [OPTIONS] source
python3 mauzalyzer.py [OPTIONS] sourceExamples:
python3 mauzalyzer.py Your_File.csv #Basic scan
python mauzalyzer.py Your_File.xlsx #Basic scan
python3 mauzalyzer.py Your_File.csv --detailed #Detailed scan
python mauzalyzer.py --fast Your_File.xlsx #Fast scan
python3 mauzalyzer.py Your_File.csv -o Report_Name --output-format=txt #Saving output in TXT format
python mauzalyzer.py Your_File.xlsx -o Report_Name --output-format=xml #Saving output in XML format| Option | Description |
|---|---|
--fast |
Fast scanning (basic checks only) |
--detailed |
Detailed scanning with deep similarity analysis |
--type csv/xlsx |
Manually set the file type |
--chunksize |
Set custom chunk size for large files |
--messy |
Preprocess messy CSV files |
-o, --output |
Custom output file name |
--output-format |
Output format: json, txt, xml |
| Flag | Description |
|---|---|
--version |
Display version and author info |
--help |
Show help screen |
--debug |
Enable full debug traceback |
--quiet |
Suppress all output |
-v, --verbose |
Show verbose output |
💡 Help menu on Linux:
💡 Help menu on Windows:
💡 Mauzalyzer in action:
{
"analysis_date": "2025-04-17T17:03:33",
"data_source": "Your_Input_File.xlsx",
"findings": [...],
"summary": {...}
}Reports are saved to the data_report/ folder and include a timestamp + hash for uniqueness.
data_report/ folder will be automatically created after first usage.
This code helps remove repeated or stray header rows inside messy CSVs (usually when a report was exported from Excel or multiple tables were merged).
❗️ When to use:
- You scanned a file and noticed weird duplicated values like "type" or "sale_date"
- You know your file includes repeated headers (you may have seen them in Excel file when you opened it)
📋 Instructions:
- Below this comment, you'll see a method called 'remove_headers(df)', edit the list 'known_headers' to include any words you want to treat as "header rows".
- In regards to editing 'known_headers', you can add more values or remove some, it's completely on you.
- Go to method 'scan_csv()' in the code, you'll see '#df = remove_headers(df)', just remove # and that's it, voila removed headers are implemented.
Example:
def scan_csv(self, similarity_threshold=85):
try:
df = self.safe_read_csv(file_path)
#df = remove_headers(df) <-- Here, simply delete # and its doneMauzalyzer Engineers are already cooking up new features for v2.0. Stay tuned! 👾
-
👁️ Better schema detection for extremely messy files
-
🗂️ Support for more formats: JSON, XML, TXT (as inputs)
-
🎛️ GUI mode (TBD)
-
🔧 Interactive mode for manual value inspection
-
⚙️ Additional CLI support
This project is open-source and distributed under the terms of the MIT License. You are free to use, modify, and distribute it with proper attribution.
All kuddos go to my professor who taught me everything I know, I think she will be proud of me using this many emojis. 😅
To all my friends who supported me on this wonderful journey — I haven't forgotten you, folks. Big thanks and much love to all of you! ❤️





