This project ingests, processes, and trains machine learning models on text datasets (starting with Hugging Face’s dair-ai/emotion).
The standard schema is:
text | label | split
The ingestion pipeline:
- Loads dataset splits (train/validation/test) from Hugging Face.
- Cleans text (removes extra spaces, trims).
- Normalises labels (lowercased, consistent spacing).
- Drops empty rows and duplicates.
- Combines all splits into one DataFrame.
- Saves the processed dataset to CSV or JSON.
The training pipeline:
- Loads processed dataset.
- Splits into train/validation sets (either random or by
splitcolumn). - Builds a pipeline: TF-IDF Vectorizer + Logistic Regression.
- Trains the model.
- Evaluates on validation set (accuracy, precision, recall, F1, confusion matrix).
- Saves trained model as
.pkl.
NLC_EMOTION/
│── data/
│ └── processed/emotion_clean.csv
│── models/
│ └── emotion_model.pkl
│── nlc_ingest/
│ ├── __init__.py
│ ├── cleaning.py
│ ├── config.py
│ └── io.py
│── src/
│ ├── emotion_classifier.py
│ ├── ingest.py
│ ├── simple_interface.py
│ ├── train_model.py
│ └── chatbot_interface.py
│── logs/
│ ├── train.log
│ ├── interface.log
│ └── ingest.log
│── reports/
│ └── figures/confusion_matrix.png
│── tests/
│── venv/
│── Makefile
│── README.md
│── requirements.txt
│── .gitignore
-
Clone the repository:
git clone https://github.com/bushaiba/ml-project-natural-language-classifier.git cd ml-project-natural-language-classifier -
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # Linux/macOS venv\Scripts\activate # Windows
-
Install dependencies:
pip install -r requirements.txt
Run the ingestion script from the command line:
python ingest.py --dataset (name of dataset) --format (csv or json) --out (path to output file)Arguments:
--dataset→ Hugging Face dataset ID (default:dair-ai/emotion)--format→ Output file format (csvorjson, default:csv)--out→ Output file path (default: fromconfig.py)
Examples:
python ingest.py
python ingest.py --format json --out data/processed/emotion.json
python ingest.py --dataset imdb --format csv --out data/processed/imdb_clean.csvAfter ingestion, train the classifier with:
python train_model.py --input data/processed/emotion_clean.csv --use_split_column --model_out models/emotion_logreg_tfidf.pklArguments:
--input→ Path to processed dataset--model_out→ Where to save the trained.pklmodel--use_split_column→ Use dataset’ssplitcolumn if available--val_size→ Validation fraction for random split (default: 0.2)--random_state→ Random seed (default: 42)
Alternatively, you can use the Makefile shortcut:
make ingest # run ingestion with defaults
make train # train model and save to models/
make interface # start CLI interfacepython simple_interface.pyInstructions shown in console:
- Type a sentence → get label (
joy,anger,fear, etc.) - Type
clear→ clear screen - Type
exit/quit/q/x→ exit
The chatbot interface wraps the trained classifier inside a natural language dialogue.
It accepts messy input, extracts the relevant span, classifies it, and responds empathetically.
python chatbot_interface.pyOptions:
--debug→ prints extracted spans and classifier labels.--model_id→ override the default Hugging Face model (default: TinyLlama).
Example session:
Assistant: Hello! Say something and I'll classify the emotion.
> My friend said 'I absolutely hated that film', do you think he liked it?
[extracted]: I absolutely hated that film
[label]: anger
Assistant: That sounds negative—hopefully the next one is better.
> I'm excited for my holiday!
[extracted]: I'm excited for my holiday!
[label]: joy
Assistant: That sounds positive—enjoy your trip.
ingest.log→ ingestion processtrain.log→ training + evaluationinterface.log→ simple interface runtimechatbot.log→ chatbot runtime
Processed files are saved in data/processed/.
Schema:
| text | label | split |
|---|---|---|
| "i didnt feel humiliated" | sadness | train |
| "i can go from hopeless to hopeful quickly" | joy | train |
After training, you can use the provided emotion_classifier.py to interact with the saved model.
Example usage:
python emotion_classifier.pyThis script:
- Loads the trained model from
models/. - Cleans and preprocesses input text.
- Provides a key method:
- classify(text) → returns the single best predicted label.
Example output:
Single prediction: joy
If you use the Emotion dataset, cite:
@inproceedings{saravia-etal-2018-carer,
title = "{CARER}: Contextualized Affect Representations for Emotion Recognition",
author = "Saravia, Elvis and
Liu, Hsien-Chi Toby and
Huang, Yen-Hao and
Wu, Junlin and
Chen, Yi-Shin",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
year = "2018",
address = "Brussels, Belgium",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D18-1404",
doi = "10.18653/v1/D18-1404",
pages = "3687--3697"
}