Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Preprocessing.ipynb		Preprocessing.ipynb
README.md		README.md
requirements.txt		requirements.txt
sentiment_analysis_BERT.ipynb		sentiment_analysis_BERT.ipynb
sentiment_analysis_ML.ipynb		sentiment_analysis_ML.ipynb
sentiment_analysis_cnn2.ipynb		sentiment_analysis_cnn2.ipynb
sentiment_analysis_xlnet.ipynb		sentiment_analysis_xlnet.ipynb

Repository files navigation

Sentiment_Analysis_comparing_strategies

Recomandation

Download from this link the GoogleNews vectors negative 300 binary file. It is possible to install all the necessary dependencies for a given virtual environment using the requirements.txt file.

Example. Setting the enviroment using virtualenv (linux user).

virtualenv -p /usr/bin/python3.12 my_env

Then activate the env

source my_env/bin/activate

Finally install packages

pip install -r requirements.txt

Aim of this work

A comparison of different strategies for sentiment analysis is presented.

The following models are selected for comparison:

Machine learning (ML) traditional algorithms (Logistic regression, Random Forest, Stochastic Gradient Descent and Bernoulli Naive Bayes);
Deep learning networks (Naive architecture, CNN);
Transformers (BERT (base) and XLNET (base)).

The following embedding are selected for comparison

Vectorisation:
- Count Vectorisation
- TF-IDF
Embedding:
- Word2vec:
  - trained on the dataset
  - pretrained version on Google-300-news

Dataset

The datasets employed are:

The objective of the preliminary data processing stage was to generate a cleaned training, testing and validation dataframe. It is important to note that all of the datasets were balanced, duplicates have been removed and the sentiment analysis involved three categories: negative, neutral and positive.

Sentiment Analysis

Evaluation metric and results

The accuracy score is utilised as the metric for evaluation. Comparison of different embedding/vectorization techniques with traditional ML models. In the present study, the following machine learning algorithms were tested.

Logistic regression,
Random Forest,
Stochastic Gradient Descent (SGD)
Bernoulli Naive Bayes

The following section will present the results obtained from the different vectorization/embedding implementations of the optimal algorithms: SGD and MultinomialNB.

	Accuracy	(optimum) ML algorithm
Embedding (w2vec-training data)	58%	SGD
Embedding (w2vec pre-trained)	63.0%	SGD
CountVectorizer	72.4%	MultinomialNB
TF-IDF Vectorizer	72.4%	MultinomialNB

In the following section, the concluding results of the study will be presented.

	Vectorization/embedding	Accuracy
MNB	Vectorization	72.4%
Naive Deep Learning (15epochs)	Embedding (sentence transformer)	70.4%
CNN (5epochs)	Embedding (sentence transformer)	69.9%
BERT	Embedding(token embedding + sentence embedding + positional encoding)	77.9%
XLNET	Embedding (token embedding + sentence embedding + positional encoding)	80.5%

About

No description, website, or topics provided.

Report repository

Releases

No releases published

Packages

No packages published

Languages

Jupyter Notebook 100.0%