Skip to content

Commit 60a0731

Browse files
committed
Small changes
1 parent e21d269 commit 60a0731

File tree

1 file changed

+34
-9
lines changed

1 file changed

+34
-9
lines changed

day_10/day_10.ipynb

Lines changed: 34 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -142,7 +142,7 @@
142142
"cell_type": "markdown",
143143
"metadata": {},
144144
"source": [
145-
"### EXERCISE\n",
145+
"### EXERCISE 1\n",
146146
"\n",
147147
"Modify the code above to plot how often \"Coronavirus\" is used in each of the three subreddits over time"
148148
]
@@ -193,8 +193,12 @@
193193
"source": [
194194
"male_words = ['he', 'his']\n",
195195
"female_words = ['she', 'hers']\n",
196+
"\n",
196197
"# This puts all of the text of each subreddit into lists\n",
197-
"grouped_text = sr.groupby('subreddit').all_text.apply(lambda x: ' '.join(x).split())\n",
198+
"def string_to_list(x):\n",
199+
" return ' '.join(x).split()\n",
200+
"grouped_text = sr.groupby('subreddit').all_text.apply(string_to_list)\n",
201+
"\n",
198202
"# Then, we count how often each type of words appears in each subreddit\n",
199203
"agg = grouped_text.aggregate({'proportionMale': lambda x: sum([x.count(y) for y in male_words])/len(x),\n",
200204
" 'proportionFemale': lambda x: sum([x.count(y) for y in female_words])/len(x)}\n",
@@ -214,9 +218,9 @@
214218
"cell_type": "markdown",
215219
"metadata": {},
216220
"source": [
217-
"### EXERCISES\n",
221+
"### EXERCISE 2\n",
218222
"\n",
219-
"1. One of the trickiest parts of analysis is getting the data in the form that you want it in order to analyze/visualize it. \n",
223+
"One of the trickiest parts of analysis is getting the data in the form that you want it in order to analyze/visualize it. \n",
220224
"\n",
221225
"I think a good visualization for this would be a barplot showing how often male and female word types appear for each subreddit. I'll give you the final call to produce the plot:\n",
222226
"\n",
@@ -240,14 +244,16 @@
240244
"### Your code here\n",
241245
"\n",
242246
"\n",
243-
"sns.barplot(x='subreddit', y='proportion', hue = 'word_gender', data = agg_df)"
247+
"#sns.barplot(x='subreddit', y='proportion', hue = 'word_gender', data = agg_df)"
244248
]
245249
},
246250
{
247251
"cell_type": "markdown",
248252
"metadata": {},
249253
"source": [
250-
"2. Make your own analysis, with a different set of terms"
254+
"### EXERCISE 3\n",
255+
"\n",
256+
"Make your own analysis, with a different set of terms"
251257
]
252258
},
253259
{
@@ -265,6 +271,27 @@
265271
"There are a number of NLP / text analysis libraries in Python. The one I'm most familiar with is scikit-learn, which is a machine learning library. NLTK, SpaCy, and textblob are some of the most popular. Here is how to run TF-IDF in scikit-learn."
266272
]
267273
},
274+
{
275+
"cell_type": "code",
276+
"execution_count": null,
277+
"metadata": {},
278+
"outputs": [],
279+
"source": [
280+
"## First, we prepare the data for the TF-IDF tool.\n",
281+
"# We want each subreddit to be represented by a list of strings.\n",
282+
"# So, we take our grouped_text (which is a list of lists of words)\n",
283+
"# and change it into a list of three really long strings, where each\n",
284+
"# string is all the words that appeared for that subreddit.\n",
285+
"\n",
286+
"# This called a 'list comprehension'\n",
287+
"as_text = [' '.join(x) for x in grouped_text]\n",
288+
"\n",
289+
"# It is equivalent to the following for loop\n",
290+
"as_text = []\n",
291+
"for x in grouped_text:\n",
292+
" as_text.append(''.join(x))"
293+
]
294+
},
268295
{
269296
"cell_type": "code",
270297
"execution_count": null,
@@ -276,8 +303,6 @@
276303
"# Just gets the 5000 most common words\n",
277304
"vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')\n",
278305
"\n",
279-
"as_text = [' '.join(x) for x in grouped_text]\n",
280-
"\n",
281306
"tfidf_result = vectorizer.fit_transform(as_text)\n",
282307
"feature_names = vectorizer.get_feature_names()\n",
283308
"dense = tfidf_result.todense()\n",
@@ -525,7 +550,7 @@
525550
"cell_type": "markdown",
526551
"metadata": {},
527552
"source": [
528-
"### EXERCISE\n",
553+
"### EXERCISE 4\n",
529554
"\n",
530555
"Where topic modeling really shines is in analyzing longer texts - for example, the subreddit [changemyview](https://www.reddit.com/r/changemyview/) has fairly long posts where people explain a controversial view that they hold.\n",
531556
"\n",

0 commit comments

Comments
 (0)