Skip to content

LostInGitHubLand/Sentiment_Analysis_comparing_strategies

Repository files navigation

Sentiment_Analysis_comparing_strategies

Recomandation

Download from this link the GoogleNews vectors negative 300 binary file. It is possible to install all the necessary dependencies for a given virtual environment using the requirements.txt file.

Example. Setting the enviroment using virtualenv (linux user).

virtualenv -p /usr/bin/python3.12 my_env

Then activate the env

source my_env/bin/activate

Finally install packages

pip install -r requirements.txt

Aim of this work

A comparison of different strategies for sentiment analysis is presented.

The following models are selected for comparison:

  • Machine learning (ML) traditional algorithms (Logistic regression, Random Forest, Stochastic Gradient Descent and Bernoulli Naive Bayes);
  • Deep learning networks (Naive architecture, CNN);
  • Transformers (BERT (base) and XLNET (base)).

The following embedding are selected for comparison

  • Vectorisation:
    • Count Vectorisation
    • TF-IDF
  • Embedding:
    • Word2vec:
      • trained on the dataset
      • pretrained version on Google-300-news

Dataset

The datasets employed are:

The objective of the preliminary data processing stage was to generate a cleaned training, testing and validation dataframe. It is important to note that all of the datasets were balanced, duplicates have been removed and the sentiment analysis involved three categories: negative, neutral and positive.

image

Sentiment Analysis

image

Evaluation metric and results

The accuracy score is utilised as the metric for evaluation. Comparison of different embedding/vectorization techniques with traditional ML models. In the present study, the following machine learning algorithms were tested.

  • Logistic regression,
  • Random Forest,
  • Stochastic Gradient Descent (SGD)
  • Bernoulli Naive Bayes

The following section will present the results obtained from the different vectorization/embedding implementations of the optimal algorithms: SGD and MultinomialNB.

Accuracy (optimum) ML algorithm
Embedding (w2vec-training data) 58% SGD
Embedding (w2vec pre-trained) 63.0% SGD
CountVectorizer 72.4% MultinomialNB
TF-IDF Vectorizer 72.4% MultinomialNB

In the following section, the concluding results of the study will be presented.

Vectorization/embedding Accuracy
MNB Vectorization 72.4%
Naive Deep Learning (15epochs) Embedding (sentence transformer) 70.4%
CNN (5epochs) Embedding (sentence transformer) 69.9%
BERT Embedding(token embedding + sentence embedding + positional encoding) 77.9%
XLNET Embedding (token embedding + sentence embedding + positional encoding) 80.5%

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published