This project implements a sentiment analysis model to classify tweets as positive or negative using a Random Forest classifier. The model leverages Natural Language Processing (NLP) techniques to preprocess text data and extract meaningful features for classification.
To set up the project, clone the repository and install the required packages:
git clone https://github.com/akshargrover/tweet-sentiment-analysis
cd tweet-sentiment-analysis
pip install -r requirements.txtTo run the sentiment analysis model, execute the following command in your terminal:
python twitter-sentiment-analysis.ipynbThis will train the model on the training dataset and evaluate it on the test dataset.
The dataset used for training and testing consists of tweets labeled as positive or negative sentiment. The data is split into training and test sets to evaluate the model's performance.
- Training Data: Contains tweets used to train the model. The training dataset is located in the
twitter_training.csvfile. - Test Data: Contains tweets used to evaluate the model's accuracy. The test dataset is located in the
twitter_validation.csvfile.
- The dataset includes a variety of tweets collected from Twitter, ensuring a diverse representation of sentiments.
- Each tweet is labeled as either positive,negative, irrelevant and natural allowing the model to learn from all classes.
- The data is preprocessed to remove noise and irrelevant information, enhancing the model's performance.
The model is built using the following steps:
-
Data Preprocessing:
- Text normalization (lowercasing, removing punctuation, etc.)
- Tokenization and lemmatization using spaCy
- Vectorization using TF-IDF to convert text into numerical features.
-
Model Selection:
- A Random Forest classifier is used for its robustness and ability to handle high-dimensional data.
-
Hyperparameters:
n_estimators: 200 (number of trees in the forest)max_depth: 15 (maximum depth of the trees)class_weight: 'balanced' (to handle class imbalance)
The model's performance is evaluated using accuracy, precision, recall, and F1-score metrics. The results are printed in the console after running the model.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
This project is licensed under the MIT License - see the LICENSE file for details.