A project showcasing hybrid system to log classification that combines rule-based, machine learning, and LLM-based classification for robust and accurate log categorization.
This project implements a three-layer classification pipeline designed for high-volume and diverse log environments:
- Regex Classifier - Fast, deterministic rule-based classification
- ML Classifier - Probabilistic classification using machine learning models
- LLM Classifier - Advanced fallback using Large Language Models for complex cases
Each layer filters logs based on confidence thresholds, ensuring speed when possible while maintaining accuracy for ambiguous cases.
├── app/ # Main application
├── models/ # Trained model artifacts
├── data/ # Data
├── scripts/ # Scripts to train ML model
├── requirements.txt # Python dependencies
├── run_api.sh # API startup script
└── README.md # This file
- Python 3.8+
- pip
- Clone the repository:
git clone https://github.com/sejalshitole/Log-Classification.git
cd Log-Classification- Create a virtual environment:
python -m venv venv
source venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Configure environment variables:
cp .env.example .envEdit .env and set your configuration:
GEMINI_API_KEY=your_api_key_hereStart the FastAPI server:
bash run_api.shor manually:
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reloadThe API will be available at http://127.0.0.1:8000/docs
Classify a single log:
curl -X POST "http://127.0.0.1:8000/v1/logs/analyze" \
-H "Content-Type: application/json" \
-d '{"log": "ERROR [2024-01-01 10:00:00] Database connection failed"}'Response:
{
"label": "ERROR",
"confidence": 0.98,
"layer": "regex",
"llm_explanation": null
}Batch process CSV file:
curl -X POST "http://127.0.0.1:8000/v1/logs/analyze/batch" \
-F "file=@logs.csv" \
-F "log_column=message"{
"label": "Type of Log",
"confidence": 0.0-1.0,
"layer": "regex|ml|llm",
"llm_explanation": "Optional explanation from LLM"
}Train or retrain the machine learning classifier:
python -m scripts.train_mlThe hybrid classifier uses a cascading approach:
Input Log
↓
[1] Regex Classifier (fast, deterministic)
├─ Confidence ≥ REGEX_CONFIDENCE? → Return
└─ Confidence < REGEX_CONFIDENCE? → Next layer
↓
[2] ML Classifier (probabilistic)
├─ Confidence ≥ ML_CONFIDENCE? → Return
└─ Confidence < ML_CONFIDENCE? → Next layer
↓
[3] LLM Classifier (slow, most accurate)
└─ Return with explanation