Download from this link the GoogleNews vectors negative 300 binary file. It is possible to install all the necessary dependencies for a given virtual environment using the requirements.txt file.
Example. Setting the enviroment using virtualenv (linux user).
virtualenv -p /usr/bin/python3.12 my_env
Then activate the env
source my_env/bin/activate
Finally install packages
pip install -r requirements.txt
A comparison of different strategies for sentiment analysis is presented.
The following models are selected for comparison:
- Machine learning (ML) traditional algorithms (Logistic regression, Random Forest, Stochastic Gradient Descent and Bernoulli Naive Bayes);
- Deep learning networks (Naive architecture, CNN);
- Transformers (BERT (base) and XLNET (base)).
The following embedding are selected for comparison
- Vectorisation:
- Count Vectorisation
- TF-IDF
- Embedding:
- Word2vec:
- trained on the dataset
- pretrained version on Google-300-news
- Word2vec:
The datasets employed are:
The objective of the preliminary data processing stage was to generate a cleaned training, testing and validation dataframe. It is important to note that all of the datasets were balanced, duplicates have been removed and the sentiment analysis involved three categories: negative, neutral and positive.
The accuracy score is utilised as the metric for evaluation. Comparison of different embedding/vectorization techniques with traditional ML models. In the present study, the following machine learning algorithms were tested.
- Logistic regression,
- Random Forest,
- Stochastic Gradient Descent (SGD)
- Bernoulli Naive Bayes
The following section will present the results obtained from the different vectorization/embedding implementations of the optimal algorithms: SGD and MultinomialNB.
| Accuracy | (optimum) ML algorithm | |
|---|---|---|
| Embedding (w2vec-training data) | 58% | SGD |
| Embedding (w2vec pre-trained) | 63.0% | SGD |
| CountVectorizer | 72.4% | MultinomialNB |
| TF-IDF Vectorizer | 72.4% | MultinomialNB |
In the following section, the concluding results of the study will be presented.
| Vectorization/embedding | Accuracy | |
|---|---|---|
| MNB | Vectorization | 72.4% |
| Naive Deep Learning (15epochs) | Embedding (sentence transformer) | 70.4% |
| CNN (5epochs) | Embedding (sentence transformer) | 69.9% |
| BERT | Embedding(token embedding + sentence embedding + positional encoding) | 77.9% |
| XLNET | Embedding (token embedding + sentence embedding + positional encoding) | 80.5% |

