This project implements an enhanced anti‑virus / malware detection system suitable for a cyber security course. Rather than focusing on machine learning internals, the tool demonstrates how static analysis, heuristics, signature matching and a pre‑trained model can be combined to detect malicious files. Improvements over the basic version include:
- File type classification: The scanner attempts to identify the high‑level file type (PE executable, ELF, text file, archive or generic binary) from magic bytes.
- Network indicator detection: Hardcoded IP addresses and URLs are counted, as they are often associated with command‑and‑control or download functionality.
- Signature‑based detection: A small database of known malware hashes triggers immediate high‑risk classification when matched.
- Expanded heuristics: Rules consider dangerous file extensions, entropy thresholds, suspicious strings and network indicators to provide transparent reasoning for each verdict.
- Multi‑file scanning & improved UI: The Streamlit interface allows uploading multiple samples at once, shows a progress bar during scanning, displays a summary table and exposes detailed information (features and detection methods) for each file. Recent scans are logged and can be reviewed in the app.
The machine learning layer remains an optional enhancement rather than the core of the system. All code is written in Python and exposed through an interactive Streamlit interface.
Malware is a broad term for any software designed to gain unauthorised access or cause damage. It encompasses viruses, worms, trojans, ransomware and spyware【37765113171284†L116-L125】. A computer virus is a specific type of malware that attaches itself to other executable files and replicates by modifying or deleting data【37765113171284†L104-L107】【37765113171284†L141-L147】. In practice all viruses are malware, but not all malware is a virus; our system therefore performs generic malware detection.
The system focuses on static malware analysis. Static analysis examines a program without executing it【686329836607284†L62-L70】, making it quick and safe. Analysts inspect metadata, strings, structure and code to identify a malware’s functionality【686329836607284†L74-L81】. Traditional anti‑virus tools use signature‑based detection: they compute a digital fingerprint of the file and compare it against a database of known malware signatures【686329836607284†L84-L90】. While effective for known threats, signature‑based methods struggle with new or modified malware and can miss code that activates only under certain conditions【686329836607284†L96-L101】.
In contrast, dynamic malware analysis executes the suspicious program in a controlled sandbox and observes its behaviour【686329836607284†L114-L131】. This approach is more comprehensive because it reveals the malware’s runtime logic, communication patterns and evasion mechanisms【686329836607284†L135-L142】. Dynamic analysis is behaviour‑based; it looks for actions rather than static signatures【686329836607284†L144-L152】. The project does not execute unknown code for safety reasons, but it borrows behavioural ideas (entropy and suspicious strings) and uses a pre‑trained model to augment static inspection.
Modern malware detection employs several complementary techniques:
-
Signature‑based detection: Antivirus software scans files and compares their unique digital footprints against a database of known malware signatures【461303845287205†L499-L514】. When a match is found, the software can quarantine or delete the file. Signature‑based detection is effective as a first line of defence but cannot detect new or polymorphic threats【461303845287205†L514-L515】 and relies on frequent updates.
-
Heuristics and static rules: To address unseen malware, security tools use heuristics such as CRC checksums, statistical analysis or specific byte patterns【461303845287205†L519-L545】. Application allowlisting restricts execution to approved software【461303845287205†L552-L566】. Our system implements simple heuristics: high entropy, suspicious API names and file type indicators.
-
Behaviour‑based (anomaly) detection: Dynamic techniques driven by machine learning learn to distinguish malicious from benign behaviour by analysing file actions, network traffic and execution patterns【461303845287205†L575-L586】. Behavioural models can detect unknown malware but may produce false positives【125292643958462†L155-L177】. The project includes a lightweight logistic regression model trained on synthetic features (file size, entropy and suspicious string count) as an enhancement layer.
-
Combining methods: Mature security solutions fuse signature‑based, heuristic and anomaly‑based detections to reduce blind spots【125292643958462†L149-L151】. Our decision engine combines the heuristic verdict with the model’s prediction and selects the highest risk level.
A detection system can make mistakes. A false positive occurs when a benign file is flagged as malicious, whereas a false negative means a malicious file is missed【898829068623872†L62-L69】. Both errors are problematic: too many false positives create alert fatigue for security teams and may cause real threats to be overlooked【898829068623872†L71-L80】, while false negatives leave systems exposed to attacks【898829068623872†L82-L86】. By combining heuristic and model‑based approaches, the system aims to reduce false negatives without producing excessive false positives. Logged results allow users to audit detections and tune heuristics.
The project follows a simple layered architecture:
- Input layer: Users upload a file through the Streamlit interface. Supported types include executables, text files and arbitrary binaries.
- Analysis layer: The backend (
detector.py) performs static analysis: it computes the SHA256 hash, file size and Shannon entropy and counts occurrences of suspicious substrings. These features are inexpensive to compute and do not require executing the sample. - Decision engine: Heuristic rules evaluate the features; for example, multiple suspicious API names, hardcoded IP addresses, URLs or very high entropy can indicate obfuscation. Known malware hashes are checked via a small signature database. If a pre‑trained model is available, the system also predicts a probability of maliciousness from the extracted features. The final verdict is the most severe outcome among the signature match, heuristics and model.
4. Alerting & logging: Scan results (timestamp, filename, hash, verdict, threat level, methods and features) are appended to a CSV log. Additional fields include the detected file type, counts of IP/URL patterns and whether a signature match occurred. The Streamlit app displays recent logs so users can review past scans. The detection module runs entirely offline; no data is sent externally.
5. UI layer: A Streamlit web application (
app.py) provides a multi‑file upload widget, a scan button and an enhanced results view. Users can upload several files at once, observe a progress bar during scanning and view detailed features (file type, entropy, suspicious string count, IP/URL counts) for each sample. Recent scan history is accessible via a separate table.
- Install requirements:
pip install -r requirements.txt. - Launch the application:
streamlit run app.py. - Upload a sample file (e.g., from the
test_samplesfolder) and click Scan File. - View the verdict, threat level, extracted features and detection methods. The Scan Log shows recent scans.
The test_samples/ directory contains three safe files for demonstration:
| File | Description |
|---|---|
| benign_sample.txt | A harmless text file with no suspicious content. |
| suspicious_sample.exe | A fake executable containing suspicious API names but no real malware. It should trigger heuristic rules. |
| unknown_sample.bin | Random binary data to test entropy‑based heuristics. |
These samples are safe and should not trigger Windows Defender; they are provided solely to test the detection pipeline.
This project demonstrates how cyber security principles can be applied to build a functional malware detection system without focusing on the mathematics of machine learning. By combining static analysis, simple heuristics and a lightweight model, the system illustrates key concepts such as signature‑based vs. behaviour‑based detection, the trade‑off between false positives and negatives, and the importance of logging and transparency. Students can extend the heuristics, integrate real signature databases or replace the model with more advanced behavioural analysis as part of their coursework.