Skip to content

mauzware/Mauzalyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧪 Mauzalyzer

Mauzalyzer Logo

Ultimate Data Duplication & Similarity Analyzer — built for developers, analysts, and data engineers who need quick insights from messy, large, or structured data files.

Created by mauzware
Works on Linux 🐧 and Windows 🧩
Fast, powerful, and customizable via CLI ⚙️


📦 Features

  • 🔁 Detects exact duplicates across rows and columns
  • 🔍 Finds similar values using fuzzy matching
  • 🧠 Automatically categorizes values: Numeric, Textual, Mixed, Unknown
  • 📊 Outputs detailed summaries and top frequent values
  • 💾 Exports reports in JSON, TXT, or XML
  • 📁 Supports both CSV and XLSX files
  • ⚡ Modes: Fast, Standard, and Detailed
  • 🧼 Experimental support for messy CSVs (--messy)
  • 🌈 Colorful CLI with optional logging and quiet/debug modes

🛠️ Installation

Make sure you have Python 3.9+ installed (tested on 3.13+).
You can use either pip or pip3, whichever works on your system depending on your Python version.

Windows

git clone https://github.com/mauzware/Mauzalyzer.git
cd Mauzalyzer
pip install -r requirements.txt
python mauzalyzer.py --help

Linux Debian/Ubuntu

git clone https://github.com/mauzware/Mauzalyzer.git
cd Mauzalyzer
pip install -r requirements.txt
python mauzalyzer.py --help

Kali Linux

In Kali, all required modules are already pre-installed.

git clone https://github.com/mauzware/Mauzalyzer.git
cd Mauzalyzer
python mauzalyzer.py --help

If you are missing some modules by any chance, you can install them with:

  1. Create a virtual environment and use: pip3 install -r requirements.txt
  2. Install them manually with apt: sudo apt install python3-[module_name]

Virtual Environment Setup

sudo apt update
sudo apt install python3-venv -y

git clone https://github.com/mauzware/Mauzalyzer.git
cd Mauzalyzer
python3 -m venv Mauzalyzer-env
source Mauzalyzer-env/bin/activate
pip install -r requirements.txt
deactivate

🖥️ Usage

You can use either python or python3, whichever works on your system depending on your Python version.

python mauzalyzer.py [OPTIONS] source
python3 mauzalyzer.py [OPTIONS] source

Examples:

python3 mauzalyzer.py Your_File.csv #Basic scan
python mauzalyzer.py Your_File.xlsx #Basic scan
python3 mauzalyzer.py Your_File.csv --detailed #Detailed scan
python mauzalyzer.py --fast Your_File.xlsx #Fast scan
python3 mauzalyzer.py Your_File.csv -o Report_Name --output-format=txt #Saving output in TXT format
python mauzalyzer.py Your_File.xlsx -o Report_Name --output-format=xml #Saving output in XML format

🔧 Basic Options

Option Description
--fast Fast scanning (basic checks only)
--detailed Detailed scanning with deep similarity analysis
--type csv/xlsx Manually set the file type
--chunksize Set custom chunk size for large files
--messy Preprocess messy CSV files
-o, --output Custom output file name
--output-format Output format: json, txt, xml

🛡️ Utility Flags

Flag Description
--version Display version and author info
--help Show help screen
--debug Enable full debug traceback
--quiet Suppress all output
-v, --verbose Show verbose output

📸 Screenshots

💡 Help menu on Linux:

Linux Help

💡 Help menu on Windows:

Windows Help

💡 Mauzalyzer in action:

Linux in action Windows in action


📂 Output Example

{
  "analysis_date": "2025-04-17T17:03:33",
  "data_source": "Your_Input_File.xlsx",
  "findings": [...],
  "summary": {...}
}

Reports are saved to the data_report/ folder and include a timestamp + hash for uniqueness.
data_report/ folder will be automatically created after first usage.


Bonus: Optional Header Row Removal, details are below.

This code helps remove repeated or stray header rows inside messy CSVs (usually when a report was exported from Excel or multiple tables were merged).

❗️ When to use:

  • You scanned a file and noticed weird duplicated values like "type" or "sale_date"
  • You know your file includes repeated headers (you may have seen them in Excel file when you opened it)

📋 Instructions:

  1. Below this comment, you'll see a method called 'remove_headers(df)', edit the list 'known_headers' to include any words you want to treat as "header rows".
  2. In regards to editing 'known_headers', you can add more values or remove some, it's completely on you.
  3. Go to method 'scan_csv()' in the code, you'll see '#df = remove_headers(df)', just remove # and that's it, voila removed headers are implemented.

Example:

def scan_csv(self, similarity_threshold=85):
    try:
        df = self.safe_read_csv(file_path)
        #df = remove_headers(df) <-- Here, simply delete # and its done

🚧 Future Plans: Mauzalyzer v2.0 (coming soon...)

Mauzalyzer Engineers are already cooking up new features for v2.0. Stay tuned! 👾

  • 👁️ Better schema detection for extremely messy files

  • 🗂️ Support for more formats: JSON, XML, TXT (as inputs)

  • 🎛️ GUI mode (TBD)

  • 🔧 Interactive mode for manual value inspection

  • ⚙️ Additional CLI support


👨‍💻 Author


📜 License

This project is open-source and distributed under the terms of the MIT License. You are free to use, modify, and distribute it with proper attribution.


All kuddos go to my professor who taught me everything I know, I think she will be proud of me using this many emojis. 😅
To all my friends who supported me on this wonderful journey — I haven't forgotten you, folks. Big thanks and much love to all of you! ❤️

About

Ultimate Data Duplication & Similarity Analyzer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages