-
Notifications
You must be signed in to change notification settings - Fork 2
pigate/twitter_sentiment_analysis_python
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
twitter_sentiment_analysis_python
=================================
A study in classification:
Purpose: Classify tweets as Positive or Negative
DataSet: Tweets from twitter stream
Requirements:
MongoDB running on port 27017
Redis on 6739
API keys from Twitter
Included data sets:
AFINN-111.txt Dictionary of words with sentiment value. Can be used as a feature set.
output*.txt Results of query of live twitter stream or simple twitter query. Can use to test if don't want to get your own sample data using twitterstream.py
> sudo mongod
Code files:
[x] init.py
[x] get_stream.py
> Setup: set environment variables (TWITTER_API_KEY, TWITTER_API_SECRET, TWITTER_ACCESS_TOKEN_KEY, TWITTER_ACCESS_TOKEN_SECRET)
*to run: python get_stream.py #creates and writes output to file with name prepended by timestamp
Approach 0:
First pass
Using the dictionary of word-sentiments
1. Calculate + and - log2 scores
Each tweet starts with + = -1.0, - = -1.0
Each distinct occurance of word in dict constributes log2(abs(score)) to either the + if is + score, or - if is - score
2. If abs(+) > abs(-), tweet is labeled 'positive'
Second pass
For unknown interesting words, calculate a likely +/- score.
1. Calculate likelihood that the word is in a + tweet vs - tweet using Bayes.
2. Store results
Third pass
Recalculate ratings of tweets
Approach 1:
Supervised Learning
1. Provide a training set of tweets that have been properly labeled with positive or negative.
2. Train.
3. Test.
Training/Initializing our classifier:
Choose Features:
1. Each tweet is broken up into tokens (lowercase), and tokens less than length of 3 and containing non English word characters are discarded.
2. Each tweet is mapped to a list of word features, distinct words ordered by frequency.
3. Calculate P(word | + ), P(word | -)
Discard any words that are difficult to tell if +/-
Criteria: word is too common. assumption: common words such as 'the' and 'and' do not have clear sentiment meaning.
How to tell if word is something like 'the' and 'and'?
Theory: frequency of word is high in each and P (word | + ) ~ P (word | - )
Use the Naive Bayes classifier:
Use the prior probabilty of each token with frequency of token in training set, and contribution from each feature.
If the word 'amazing' appears in 1 of 5 of positive tweets and none of negative tweets, the likelihood that the tweet is 'positive' is multiplied by .2
Actual classification:
1. Tweet starts out with likelihood 'positive': -1.0, 'negative': -1.0 ( <- log base 2 (1/2))
2. Add contributions from features
(ignore words not in feature list)
note: log base 2 again
eg) 'pos': -6.2, 'neg' : -15.3
3. Classify tweet as 'pos' or 'neg'. If abs('positive') > abs('negative'), classify as 'pos'
Primer on Bayes Probability:
Prior Probability: P(A), P(B)
Conditional Probability: P(x|A), P(x|B)
Posterior Probability: P(A|x), P(B|x)
[x] tweet_sentiment.py
*to run: python tweet_sentiment.py AFINN-111.txt tweet_file.txt result.txt*
[x] term_sentiment.py
*to run: python term_sentiment.py AFINN-111.txt tweet_file result.txt*
[x] frequency.py
*to run: python frequency.py tweet_file result.txt*
Still need to determine which state is the happiest.
Going to start with...
Dictionary of states (only 50 nodes. not too bad): (Sentiment Rating, tweet counts)
If can determine state of the tweeter:
Determine sentiment of a tweet.
Update dictionary with tweet sentiment value (compute avg?)
Flip out the dictionary such that
get new dict with
sentiment rating: [state1, state2...]
Our sentiment:state_array dict is in asc. order so,
we grab the last 10 or so and reverse this bunch
print out up to 10 of the states. --> happiest 10 states!
About
Analyze twitter data, calculate sentiment, etc. Works with live stream or twitter queries
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published