Machine Learning Project – Natural Language Classifier

Overview

This project ingests, processes, and trains machine learning models on text datasets (starting with Hugging Face’s dair-ai/emotion).

The standard schema is:

text | label | split

The ingestion pipeline:

Loads dataset splits (train/validation/test) from Hugging Face.
Cleans text (removes extra spaces, trims).
Normalises labels (lowercased, consistent spacing).
Drops empty rows and duplicates.
Combines all splits into one DataFrame.
Saves the processed dataset to CSV or JSON.

The training pipeline:

Loads processed dataset.
Splits into train/validation sets (either random or by split column).
Builds a pipeline: TF-IDF Vectorizer + Logistic Regression.
Trains the model.
Evaluates on validation set (accuracy, precision, recall, F1, confusion matrix).
Saves trained model as .pkl.

Project Structure

NLC_EMOTION/
│── data/
│   └── processed/emotion_clean.csv
│── models/
│   └── emotion_model.pkl
│── nlc_ingest/
│   ├── __init__.py
│   ├── cleaning.py
│   ├── config.py
│   └── io.py
│── src/
│   ├── emotion_classifier.py
│   ├── ingest.py
│   ├── simple_interface.py
│   ├── train_model.py
│   └── chatbot_interface.py
│── logs/
│   ├── train.log
│   ├── interface.log
│   └── ingest.log
│── reports/
│   └── figures/confusion_matrix.png
│── tests/
│── venv/
│── Makefile
│── README.md
│── requirements.txt
│── .gitignore

Installation

Clone the repository:

git clone https://github.com/bushaiba/ml-project-natural-language-classifier.git
cd ml-project-natural-language-classifier

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate   # Linux/macOS
venv\Scripts\activate    # Windows

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Ingest Data

Run the ingestion script from the command line:

python ingest.py --dataset (name of dataset) --format (csv or json) --out (path to output file)

Arguments:

--dataset → Hugging Face dataset ID (default: dair-ai/emotion)
--format → Output file format (csv or json, default: csv)
--out → Output file path (default: from config.py)

Examples:

python ingest.py
python ingest.py --format json --out data/processed/emotion.json
python ingest.py --dataset imdb --format csv --out data/processed/imdb_clean.csv

Train Model

After ingestion, train the classifier with:

python train_model.py --input data/processed/emotion_clean.csv --use_split_column --model_out models/emotion_logreg_tfidf.pkl

Arguments:

--input → Path to processed dataset
--model_out → Where to save the trained .pkl model
--use_split_column → Use dataset’s split column if available
--val_size → Validation fraction for random split (default: 0.2)
--random_state → Random seed (default: 42)

Alternatively, you can use the Makefile shortcut:

make ingest     # run ingestion with defaults
make train      # train model and save to models/
make interface  # start CLI interface

Run Simple Classifier Interface

python simple_interface.py

Instructions shown in console:

Type a sentence → get label (joy, anger, fear, etc.)
Type clear → clear screen
Type exit/quit/q/x → exit

Run Chatbot Interface

The chatbot interface wraps the trained classifier inside a natural language dialogue.
It accepts messy input, extracts the relevant span, classifies it, and responds empathetically.

python chatbot_interface.py

Options:

--debug → prints extracted spans and classifier labels.
--model_id → override the default Hugging Face model (default: TinyLlama).

Example session:

Assistant: Hello! Say something and I'll classify the emotion.

> My friend said 'I absolutely hated that film', do you think he liked it?
[extracted]: I absolutely hated that film
[label]: anger
Assistant: That sounds negative—hopefully the next one is better.

> I'm excited for my holiday!
[extracted]: I'm excited for my holiday!
[label]: joy
Assistant: That sounds positive—enjoy your trip.

Logging

ingest.log → ingestion process
train.log → training + evaluation
interface.log → simple interface runtime
chatbot.log → chatbot runtime

Processed Data

Processed files are saved in data/processed/.
Schema:

text	label	split
"i didnt feel humiliated"	sadness	train
"i can go from hopeless to hopeful quickly"	joy	train

Emotion Classifier (Interactive Predictions)

After training, you can use the provided emotion_classifier.py to interact with the saved model.

Example usage:

python emotion_classifier.py

This script:

Loads the trained model from models/.
Cleans and preprocesses input text.
Provides a key method:
- classify(text) → returns the single best predicted label.

Example output:

Single prediction: joy

Citation

If you use the Emotion dataset, cite:

@inproceedings{saravia-etal-2018-carer,
    title = "{CARER}: Contextualized Affect Representations for Emotion Recognition",
    author = "Saravia, Elvis  and
      Liu, Hsien-Chi Toby  and
      Huang, Yen-Hao  and
      Wu, Junlin  and
      Chen, Yi-Shin",
    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
    year = "2018",
    address = "Brussels, Belgium",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D18-1404",
    doi = "10.18653/v1/D18-1404",
    pages = "3687--3697"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Project – Natural Language Classifier

Overview

Project Structure

Installation

Usage

Ingest Data

Train Model

Run Simple Classifier Interface

Run Chatbot Interface

Logging

Processed Data

Emotion Classifier (Interactive Predictions)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
logs		logs
models		models
nlc_ingest		nlc_ingest
reports/figures		reports/figures
src		src
tests		tests
Makefile		Makefile
README.md		README.md
chatbot.log		chatbot.log
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Project – Natural Language Classifier

Overview

Project Structure

Installation

Usage

Ingest Data

Train Model

Run Simple Classifier Interface

Run Chatbot Interface

Logging

Processed Data

Emotion Classifier (Interactive Predictions)

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages