GitHub - pigate/twitter_sentiment_analysis_python: Analyze twitter data, calculate sentiment, etc. Works with live stream or twitter queries

pigate / twitter_sentiment_analysis_python Public

Notifications You must be signed in to change notification settings
Fork 2
Star 5

Analyze twitter data, calculate sentiment, etc. Works with live stream or twitter queries

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
app		app
.gitignore		.gitignore
AFINN-111.txt		AFINN-111.txt
AFINN-README.txt		AFINN-README.txt
README		README
frequency.py		frequency.py
frequency_for_text.py		frequency_for_text.py
get_stream.py		get_stream.py
init.py		init.py
server.js		server.js
term_sentiment.py		term_sentiment.py
top_ten.py		top_ten.py
tw_utils.py		tw_utils.py

Repository files navigation

twitter_sentiment_analysis_python
=================================
A study in classification:

Purpose: Classify tweets as Positive or Negative

DataSet: Tweets from twitter stream

Requirements:
MongoDB running on port 27017
Redis on 6739
API keys from Twitter

Included data sets:
AFINN-111.txt   Dictionary of words with sentiment value. Can be used as a feature set.
output*.txt   Results of query of live twitter stream or simple twitter query. Can use to test if don't want to get your own sample data using twitterstream.py

> sudo mongod
Code files:
[x] init.py
[x] get_stream.py
> Setup: set environment variables (TWITTER_API_KEY, TWITTER_API_SECRET, TWITTER_ACCESS_TOKEN_KEY, TWITTER_ACCESS_TOKEN_SECRET)
*to run: python get_stream.py #creates and writes output to file with name prepended by timestamp



Approach 0:
First pass 
Using the dictionary of word-sentiments
1. Calculate + and - log2 scores
Each tweet starts with + = -1.0, - = -1.0
Each distinct occurance of word in dict constributes log2(abs(score)) to either the + if is + score, or - if is - score
2. If abs(+) > abs(-), tweet is labeled 'positive'

Second pass
For unknown interesting words, calculate a likely +/- score.
1. Calculate likelihood that the word is in a + tweet vs - tweet using Bayes.
2. Store results

Third pass
Recalculate ratings of tweets


Approach 1:
Supervised Learning
1. Provide a training set of tweets that have been properly labeled with positive or negative.
2. Train.
3. Test.

Training/Initializing our classifier:
Choose Features:
1. Each tweet is broken up into tokens (lowercase), and tokens less than length of 3 and containing non English word characters are discarded.
2. Each tweet is mapped to a list of word features, distinct words ordered by frequency.
3. Calculate  P(word | + ), P(word | -)
Discard any words that are difficult to tell if +/-
Criteria: word is too common. assumption: common words such as 'the' and 'and' do not have clear sentiment meaning. 
How to tell if word is something like 'the' and 'and'?
Theory: frequency of word is high in each and P (word | + ) ~ P (word | - )


Use the Naive Bayes classifier:
Use the prior probabilty of each token with frequency of token in training set, and contribution from each feature.
If the word 'amazing' appears in 1 of 5 of positive tweets and none of negative tweets, the likelihood that the tweet is 'positive' is multiplied by .2

Actual classification:
1. Tweet starts out with likelihood 'positive': -1.0, 'negative': -1.0 ( <- log base 2 (1/2))
2. Add contributions from features
(ignore words not in feature list)
note: log base 2 again
eg) 'pos': -6.2, 'neg' : -15.3
3. Classify tweet as 'pos' or 'neg'. If abs('positive') > abs('negative'), classify as 'pos'

Primer on Bayes Probability:
Prior Probability: P(A), P(B)
Conditional Probability: P(x|A), P(x|B)
Posterior Probability: P(A|x), P(B|x)

[x] tweet_sentiment.py
*to run: python tweet_sentiment.py AFINN-111.txt tweet_file.txt result.txt*
[x] term_sentiment.py
*to run: python term_sentiment.py AFINN-111.txt tweet_file result.txt*
[x] frequency.py
*to run: python frequency.py tweet_file result.txt*


Still need to determine which state is the happiest.
Going to start with...

Dictionary of states (only 50 nodes. not too bad): (Sentiment Rating, tweet counts)

If can determine state of the tweeter:
  Determine sentiment of a tweet.
  Update dictionary with tweet sentiment value (compute avg?)

Flip out the dictionary such that
  get new dict with 
    sentiment rating: [state1, state2...]

Our sentiment:state_array dict is in asc. order so,
  we grab the last 10 or so and reverse this bunch
  print out up to 10 of the states. --> happiest 10 states!