|
142 | 142 | "cell_type": "markdown", |
143 | 143 | "metadata": {}, |
144 | 144 | "source": [ |
145 | | - "### EXERCISE\n", |
| 145 | + "### EXERCISE 1\n", |
146 | 146 | "\n", |
147 | 147 | "Modify the code above to plot how often \"Coronavirus\" is used in each of the three subreddits over time" |
148 | 148 | ] |
|
193 | 193 | "source": [ |
194 | 194 | "male_words = ['he', 'his']\n", |
195 | 195 | "female_words = ['she', 'hers']\n", |
| 196 | + "\n", |
196 | 197 | "# This puts all of the text of each subreddit into lists\n", |
197 | | - "grouped_text = sr.groupby('subreddit').all_text.apply(lambda x: ' '.join(x).split())\n", |
| 198 | + "def string_to_list(x):\n", |
| 199 | + " return ' '.join(x).split()\n", |
| 200 | + "grouped_text = sr.groupby('subreddit').all_text.apply(string_to_list)\n", |
| 201 | + "\n", |
198 | 202 | "# Then, we count how often each type of words appears in each subreddit\n", |
199 | 203 | "agg = grouped_text.aggregate({'proportionMale': lambda x: sum([x.count(y) for y in male_words])/len(x),\n", |
200 | 204 | " 'proportionFemale': lambda x: sum([x.count(y) for y in female_words])/len(x)}\n", |
|
214 | 218 | "cell_type": "markdown", |
215 | 219 | "metadata": {}, |
216 | 220 | "source": [ |
217 | | - "### EXERCISES\n", |
| 221 | + "### EXERCISE 2\n", |
218 | 222 | "\n", |
219 | | - "1. One of the trickiest parts of analysis is getting the data in the form that you want it in order to analyze/visualize it. \n", |
| 223 | + "One of the trickiest parts of analysis is getting the data in the form that you want it in order to analyze/visualize it. \n", |
220 | 224 | "\n", |
221 | 225 | "I think a good visualization for this would be a barplot showing how often male and female word types appear for each subreddit. I'll give you the final call to produce the plot:\n", |
222 | 226 | "\n", |
|
240 | 244 | "### Your code here\n", |
241 | 245 | "\n", |
242 | 246 | "\n", |
243 | | - "sns.barplot(x='subreddit', y='proportion', hue = 'word_gender', data = agg_df)" |
| 247 | + "#sns.barplot(x='subreddit', y='proportion', hue = 'word_gender', data = agg_df)" |
244 | 248 | ] |
245 | 249 | }, |
246 | 250 | { |
247 | 251 | "cell_type": "markdown", |
248 | 252 | "metadata": {}, |
249 | 253 | "source": [ |
250 | | - "2. Make your own analysis, with a different set of terms" |
| 254 | + "### EXERCISE 3\n", |
| 255 | + "\n", |
| 256 | + "Make your own analysis, with a different set of terms" |
251 | 257 | ] |
252 | 258 | }, |
253 | 259 | { |
|
265 | 271 | "There are a number of NLP / text analysis libraries in Python. The one I'm most familiar with is scikit-learn, which is a machine learning library. NLTK, SpaCy, and textblob are some of the most popular. Here is how to run TF-IDF in scikit-learn." |
266 | 272 | ] |
267 | 273 | }, |
| 274 | + { |
| 275 | + "cell_type": "code", |
| 276 | + "execution_count": null, |
| 277 | + "metadata": {}, |
| 278 | + "outputs": [], |
| 279 | + "source": [ |
| 280 | + "## First, we prepare the data for the TF-IDF tool.\n", |
| 281 | + "# We want each subreddit to be represented by a list of strings.\n", |
| 282 | + "# So, we take our grouped_text (which is a list of lists of words)\n", |
| 283 | + "# and change it into a list of three really long strings, where each\n", |
| 284 | + "# string is all the words that appeared for that subreddit.\n", |
| 285 | + "\n", |
| 286 | + "# This called a 'list comprehension'\n", |
| 287 | + "as_text = [' '.join(x) for x in grouped_text]\n", |
| 288 | + "\n", |
| 289 | + "# It is equivalent to the following for loop\n", |
| 290 | + "as_text = []\n", |
| 291 | + "for x in grouped_text:\n", |
| 292 | + " as_text.append(''.join(x))" |
| 293 | + ] |
| 294 | + }, |
268 | 295 | { |
269 | 296 | "cell_type": "code", |
270 | 297 | "execution_count": null, |
|
276 | 303 | "# Just gets the 5000 most common words\n", |
277 | 304 | "vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')\n", |
278 | 305 | "\n", |
279 | | - "as_text = [' '.join(x) for x in grouped_text]\n", |
280 | | - "\n", |
281 | 306 | "tfidf_result = vectorizer.fit_transform(as_text)\n", |
282 | 307 | "feature_names = vectorizer.get_feature_names()\n", |
283 | 308 | "dense = tfidf_result.todense()\n", |
|
525 | 550 | "cell_type": "markdown", |
526 | 551 | "metadata": {}, |
527 | 552 | "source": [ |
528 | | - "### EXERCISE\n", |
| 553 | + "### EXERCISE 4\n", |
529 | 554 | "\n", |
530 | 555 | "Where topic modeling really shines is in analyzing longer texts - for example, the subreddit [changemyview](https://www.reddit.com/r/changemyview/) has fairly long posts where people explain a controversial view that they hold.\n", |
531 | 556 | "\n", |
|
0 commit comments