Small changes

jdfoote · jdfoote · commit 60a07316aaa2 · 2020-06-01T11:38:32.000-04:00
diff --git a/day_10/day_10.ipynb b/day_10/day_10.ipynb
@@ -142,7 +142,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### EXERCISE\n",
+    "### EXERCISE 1\n",
     "\n",
     "Modify the code above to plot how often \"Coronavirus\" is used in each of the three subreddits over time"
    ]
@@ -193,8 +193,12 @@
    "source": [
     "male_words = ['he', 'his']\n",
     "female_words = ['she', 'hers']\n",
+    "\n",
     "# This puts all of the text of each subreddit into lists\n",
-    "grouped_text = sr.groupby('subreddit').all_text.apply(lambda x: ' '.join(x).split())\n",
+    "def string_to_list(x):\n",
+    "    return ' '.join(x).split()\n",
+    "grouped_text = sr.groupby('subreddit').all_text.apply(string_to_list)\n",
+    "\n",
     "# Then, we count how often each type of words appears in each subreddit\n",
     "agg = grouped_text.aggregate({'proportionMale': lambda x: sum([x.count(y) for y in male_words])/len(x),\n",
     "                        'proportionFemale': lambda x: sum([x.count(y) for y in female_words])/len(x)}\n",
@@ -214,9 +218,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### EXERCISES\n",
+    "### EXERCISE 2\n",
     "\n",
-    "1. One of the trickiest parts of analysis is getting the data in the form that you want it in order to analyze/visualize it. \n",
+    "One of the trickiest parts of analysis is getting the data in the form that you want it in order to analyze/visualize it. \n",
     "\n",
     "I think a good visualization for this would be a barplot showing how often male and female word types appear for each subreddit. I'll give you the final call to produce the plot:\n",
     "\n",
@@ -240,14 +244,16 @@
     "### Your code here\n",
     "\n",
     "\n",
-    "sns.barplot(x='subreddit', y='proportion', hue = 'word_gender', data = agg_df)"
+    "#sns.barplot(x='subreddit', y='proportion', hue = 'word_gender', data = agg_df)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "2. Make your own analysis, with a different set of terms"
+    "### EXERCISE 3\n",
+    "\n",
+    "Make your own analysis, with a different set of terms"
    ]
   },
   {
@@ -265,6 +271,27 @@
     "There are a number of NLP / text analysis libraries in Python. The one I'm most familiar with is scikit-learn, which is a machine learning library. NLTK, SpaCy, and textblob are some of the most popular. Here is how to run TF-IDF in scikit-learn."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "## First, we prepare the data for the TF-IDF tool.\n",
+    "# We want each subreddit to be represented by a list of strings.\n",
+    "# So, we take our grouped_text (which is a list of lists of words)\n",
+    "# and change it into a list of three really long strings, where each\n",
+    "# string is all the words that appeared for that subreddit.\n",
+    "\n",
+    "# This called a 'list comprehension'\n",
+    "as_text = [' '.join(x) for x in grouped_text]\n",
+    "\n",
+    "# It is equivalent to the following for loop\n",
+    "as_text = []\n",
+    "for x in grouped_text:\n",
+    "    as_text.append(''.join(x))"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -276,8 +303,6 @@
     "# Just gets the 5000 most common words\n",
     "vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')\n",
     "\n",
-    "as_text = [' '.join(x) for x in grouped_text]\n",
-    "\n",
     "tfidf_result = vectorizer.fit_transform(as_text)\n",
     "feature_names = vectorizer.get_feature_names()\n",
     "dense = tfidf_result.todense()\n",
@@ -525,7 +550,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### EXERCISE\n",
+    "### EXERCISE 4\n",
     "\n",
     "Where topic modeling really shines is in analyzing longer texts - for example, the subreddit [changemyview](https://www.reddit.com/r/changemyview/) has fairly long posts where people explain a controversial view that they hold.\n",
     "\n",