Detect malicious Unicode characters in source code
Unicleaner is a security-focused CLI tool that scans source code repositories to detect potentially malicious Unicode characters that could hide backdoors or exploits, including:
- Zero-width characters (U+200B, U+200C, U+200D, U+FEFF)
- Bidirectional override characters (U+202A-U+202E) - Trojan Source attacks
- Homoglyphs - visually similar characters from different scripts
- Non-printable control characters outside standard ASCII range
- 🔒 Deny-by-default security - only explicitly allowed characters pass
- ⚙️ Configurable - TOML-based configuration with language presets
- 🚀 Fast - parallel scanning with Rayon
- 🎨 Colored output - human-readable terminal output with automatic TTY detection
- 📊 JSON output - machine-parseable format for CI/CD integration
- 🔄 Git integration - scan only changed files in pull requests
- 🌍 Multilingual support - 50+ language presets for legitimate Unicode
cargo install unicleanernix run github:poelzi/unicleaner# Pull from GitHub Container Registry
docker pull ghcr.io/poelzi/unicleaner:latest
# Scan current directory
docker run --rm -v "$(pwd):/workspace" ghcr.io/poelzi/unicleaner:latest .See Docker Usage Guide for detailed instructions and CI/CD integration examples.
Use the published action in your repository workflow:
name: Unicode Security Check
on:
pull_request:
branches: [main]
jobs:
unicode-security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: poelzi/unicleaner@v1
with:
mode: diff
base-ref: main
fail-on-violations: trueYou can also call the reusable workflow directly:
jobs:
unicode-security:
uses: poelzi/unicleaner/.github/workflows/unicode-check.yml@v1
with:
mode: diff
base-ref: main
fail-on-violations: truegit clone https://github.com/poelzi/unicleaner
cd unicleaner
cargo build --release
./target/release/unicleaner --version# Scan current directory
unicleaner scan .
# Generate default config
unicleaner init
# Scan with custom config
unicleaner scan . --config unicleaner.toml
# Scan only Git changes (for CI/CD)
unicleaner scan . --diff
# Output JSON for machine parsing
unicleaner scan . --format json
# Filter by severity level
unicleaner scan . --severity error
# Control color output
unicleaner scan . --color always
unicleaner scan . --color never
unicleaner scan . --no-color # deprecated but supported
# Quiet mode (summary only)
unicleaner scan . --quiet
# Verbose mode (show progress)
unicleaner scan . --verbose
# List available language presets
unicleaner list-presetsWhen scanning code with bidirectional override characters (like CVE-2021-42574):
$ unicleaner scan tests/fixtures/trojan_source.rs🔍 Scanning: tests/fixtures/trojan_source.rs
❌ VIOLATION: tests/fixtures/trojan_source.rs:12:45
Character: U+202E (RIGHT-TO-LEFT OVERRIDE)
Category: Bidi Control
Severity: ERROR
Pattern: Bidirectional Override
Description: Character can reorder text visually, potentially hiding malicious code
Context:
10 | fn is_admin(user: &str) -> bool {
11 | let access_level = check_user(user);
12 | if access_level == "admin"/* }if access_level != "user { // */
^
13 | return true;
14 | }
───────────────────────────────────────────────────────────────────────────────
Scan Result: FAILED
Files scanned: 1
Files clean: 0
Files with violations: 1
Total violations: 1
Severity breakdown:
ERROR: 1
WARNING: 0
INFO: 0
Scanning for invisible characters that could hide backdoors:
$ unicleaner scan tests/fixtures/zero_width.py --verbose🔍 Scanning directory: tests/fixtures/
[1/3] tests/fixtures/zero_width.py
❌ VIOLATION: tests/fixtures/zero_width.py:5:23
Character: U+200B (ZERO WIDTH SPACE)
Category: Zero Width
Severity: WARNING
Pattern: Zero-Width Character
Description: Invisible character that serves no legitimate purpose in code
Context:
3 | def authenticate(username, password):
4 | # Check credentials
5 | if username == "admin": # Zero-width space after admin
^
6 | return check_admin_access(password)
7 | return False
[2/3] tests/fixtures/clean_file.rs ✓
[3/3] tests/fixtures/clean_file.py ✓
───────────────────────────────────────────────────────────────────────────────
Scan Result: FAILED
Files scanned: 3
Files clean: 2
Files with violations: 1
Total violations: 1
Severity breakdown:
ERROR: 0
WARNING: 1
INFO: 0
Duration: 12ms
When everything is safe:
$ unicleaner scan src/ --quietScan Result: PASSED ✓
Files scanned: 42
Files clean: 42
Files with violations: 0
Duration: 156ms
Machine-readable output for automation:
$ unicleaner scan suspicious.rs --format json{
"violations": [
{
"file_path": "suspicious.rs",
"line": 12,
"column": 45,
"code_point": 8238,
"character": "",
"category": "BidiControl",
"severity": "Error",
"pattern_name": "Bidirectional Override",
"description": "Character can reorder text visually, potentially hiding malicious code",
"context": {
"before": "if access_level == \"admin\"/*",
"match": "",
"after": " }if access_level != \"user { // */"
}
}
],
"files_scanned": 1,
"files_clean": 0,
"files_with_violations": 1,
"errors": [],
"duration_ms": 8,
"config_used": "unicleaner.toml"
}Unicleaner includes a test corpus with intentional malicious Unicode to verify detection:
$ unicleaner scan tests/fixtures/ 🔍 Scanning: tests/fixtures/
❌ Found 12 violations in test corpus (expected for testing)
Test files intentionally contain malicious Unicode patterns:
✓ Trojan Source attacks (CVE-2021-42574)
✓ Zero-width characters
✓ Homoglyph attacks
✓ Non-printable control characters
✓ Mixed script confusables
This verifies that detection is working correctly!
───────────────────────────────────────────────────────────────────────────────
Scan Result: FAILED (as expected for test corpus)
Files scanned: 8
Files with violations: 8
Total violations: 12
Only scan changed files in a pull request:
$ unicleaner scan . --diff🔍 Git diff mode: scanning only changed files
Changed files in current branch:
M src/auth.rs
M src/utils.rs
A tests/test_new_feature.rs
[1/3] src/auth.rs ✓
[2/3] src/utils.rs ✓
[3/3] tests/test_new_feature.rs ✓
───────────────────────────────────────────────────────────────────────────────
Scan Result: PASSED ✓
Files scanned: 3
Files clean: 3
Files with violations: 0
All changed files are safe to merge!
scan [PATH]- Scan files for malicious Unicode (default command)init [FILE]- Generate a default configuration filelist-presets- Show available language presets
-c, --config <FILE>- Path to configuration file-f, --format <FORMAT>- Output format: human, json, github, gitlab (default: human)--color <WHEN>- Color output: auto, always, never (default: auto)--no-color- Disable color output (deprecated, use --color=never)-q, --quiet- Show only summary (suppress individual violations)-v, --verbose- Show verbose output with progress messages--severity <LEVEL>- Minimum severity to report: error, warning, info
--diff- Scan only files changed in Git (requires Git repository)-j, --jobs <N>- Number of parallel threads (default: number of CPUs)--encoding <ENC>- Force specific encoding: utf8, utf16-le, utf16-be, utf32-le, utf32-be
0- Success: No violations found1- Violations found2- Error: Invalid arguments, file read errors, etc.
- Quickstart Guide
- Configuration Examples
- CI/CD Integration
- GitHub Action Guide
- Docker Usage Guide
- Nix Build System
Prevent malicious Unicode from entering your repository:
# .git/hooks/pre-commit
#!/bin/bash
if command -v unicleaner &> /dev/null; then
unicleaner scan --diff --severity error
exit $?
fiOr use with pre-commit framework:
# .pre-commit-config.yaml
repos:
- repo: local
hooks:
- id: unicleaner
name: Unicode Security Scanner
entry: unicleaner scan --diff --severity error
language: system
pass_filenames: falseScan pull requests automatically:
# .github/workflows/unicode-security.yml
name: Unicode Security Check
on:
pull_request:
branches: [main]
jobs:
scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: poelzi/unicleaner@v1
with:
mode: diff
base-ref: main
fail-on-violations: true# .gitlab-ci.yml
unicode-security-scan:
stage: test
image: ghcr.io/poelzi/unicleaner:latest
script:
- unicleaner scan . --format json > gl-code-quality-report.json
artifacts:
reports:
codequality: gl-code-quality-report.jsonScan third-party dependencies before integration:
# Scan a downloaded library
unicleaner scan vendor/suspicious-library/ --severity error
# Scan before npm/cargo/pip install
unicleaner scan package-to-audit/ && npm installGenerate reports for code review platforms:
# GitHub format (for PR comments)
unicleaner scan . --format github > review-comments.json
# GitLab format
unicleaner scan . --format gitlab > gitlab-report.jsonVS Code task configuration:
// .vscode/tasks.json
{
"version": "2.0.0",
"tasks": [
{
"label": "Unicode Security Scan",
"type": "shell",
"command": "unicleaner scan ${file} --color always",
"problemMatcher": [],
"group": {
"kind": "test",
"isDefault": false
}
}
]
}Scan specific packages or services:
# Scan all services
for service in services/*; do
echo "Scanning $service..."
unicleaner scan "$service" --quiet || exit 1
done
# Scan only changed packages in monorepo
CHANGED_PACKAGES=$(git diff --name-only main... | cut -d/ -f1-2 | sort -u)
for pkg in $CHANGED_PACKAGES; do
unicleaner scan "$pkg"
doneDaily scans with notification:
#!/bin/bash
# daily-scan.sh
REPORT_FILE="scan-$(date +%Y%m%d).json"
unicleaner scan . --format json > "$REPORT_FILE"
VIOLATIONS=$(jq '.violations | length' "$REPORT_FILE")
if [ "$VIOLATIONS" -gt 0 ]; then
# Send alert (Slack, email, etc.)
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d "{\"text\": \"⚠️ Found $VIOLATIONS Unicode violations in codebase!\"}"
fiEnsure clean releases:
# Before tagging a release
unicleaner scan . --severity error
if [ $? -eq 0 ]; then
git tag v1.0.0
git push origin v1.0.0
else
echo "❌ Cannot release: Unicode violations found!"
exit 1
fiGenerate compliance reports:
# Scan and generate audit report
unicleaner scan . \
--format json \
> compliance-report-$(date +%Y%m%d).json
# Convert to PDF for compliance documentation
jq -r '.violations[] |
"File: \(.file_path)\n" +
"Line: \(.line)\n" +
"Issue: \(.pattern_name)\n" +
"Severity: \(.severity)\n\n"' \
compliance-report-*.json > audit.txtnix developcurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | shcargo build
cargo test
cargo clippy
cargo fmtOr use just recipes:
just build
just test
just check
just fmt-check
just build-static
just build-docker
just coverage
just fuzz fuzz-parallel-scanner 30cargo run -- .Unicleaner has comprehensive test coverage including:
Unit tests:
cargo test --libIntegration tests:
cargo test --test integrationProperty-based tests (with proptest):
cargo test --test proptest
# Run with more cases
PROPTEST_CASES=10000 cargo test --test proptestFuzz testing (requires nightly Rust):
cargo +nightly fuzz run fuzz_unicode -- -max_total_time=60
cargo +nightly fuzz run fuzz_config -- -max_total_time=60
cargo +nightly fuzz run encoding_detection -- -max_total_time=60Performance benchmarks:
cargo benchCode coverage:
cargo tarpaulin --out HtmlSee Testing Documentation for more details.
Dual-licensed under MIT OR Apache-2.0
This tool helps detect Unicode-based security vulnerabilities. For security issues in the tool itself, please report responsibly via GitHub Security Advisories.