minor spelling/grammar for chapter 6

gbenmartin · gbenmartin · commit 34d9dce20521 · 2013-07-23T14:37:11.000-07:00
diff --git a/Chapter6_Priorities/Priors.ipynb b/Chapter6_Priorities/Priors.ipynb
@@ -31,13 +31,13 @@
       "\n",
       "Bayesian priors can be classified into two classes: *objective* priors, which aim to allow the data to influence the posterior the most, and *subjective* priors, which allow the practitioner to express his or her views into the prior. \n",
       "\n",
-      "What is an example of a objective prior? We have seen some already, including the *flat* prior (which is a uniform distribution over the entire possible range of the unknown). Using a flat prior implies we give each possible value an equal weighting. Choosing this type of prior is invoking what is called \"The Principle of Indifference\", literally we have no a prior reason to favor one value over another. Though similar, but not correct, is calling a flat prior over a restricted space an objective prior. For example, if we know $p$ in a Binomial model is greater than 0.5, then $\\text{Uniform}(0.5,1)$ is not an objective prior (since we have used prior knowledge) even though it is \"flat\" over [0.5, 1]. The flat prior must be flat along the *entire* range of possibilities. \n",
+      "What is an example of an objective prior? We have seen some already, including the *flat* prior (which is a uniform distribution over the entire possible range of the unknown). Using a flat prior implies we give each possible value an equal weighting. Choosing this type of prior is invoking what is called \"The Principle of Indifference\", literally we have no prior reason to favor one value over another. Though similar, but not correct, is calling a flat prior over a restricted space an objective prior. For example, if we know $p$ in a Binomial model is greater than 0.5, then $\\text{Uniform}(0.5,1)$ is not an objective prior (since we have used prior knowledge) even though it is \"flat\" over [0.5, 1]. The flat prior must be flat along the *entire* range of possibilities. \n",
       "\n",
       "Aside from the flat prior, other examples of objective priors are less obvious, but they contain important characteristics that reflect objectivity. For now, it should be said that *rarely* is a objective prior *truly* objective. We will see this later. \n",
       "\n",
       "#### Subjective Priors\n",
       "\n",
-      "On the other hand, if we added more probability mass to certain areas of the prior, and less elsewhere, we are biasing our inference towards the unknowns existing in the former area. This is known as a subjective, or *informative* prior. In the figure below, the subjective prior reflects a belief that the unknown likely lives around 0.5, and not around the extremes. The objective priors is insensitive to this."
+      "On the other hand, if we added more probability mass to certain areas of the prior, and less elsewhere, we are biasing our inference towards the unknowns existing in the former area. This is known as a subjective, or *informative* prior. In the figure below, the subjective prior reflects a belief that the unknown likely lives around 0.5, and not around the extremes. The objective prior is insensitive to this."
      ]
     },
     {
@@ -105,7 +105,7 @@
       "\n",
       "The choice, either *objective* or *subjective* mostly depend on the problem being solved, but there are a few cases where one is preferred over the other. In instances of scientific research, the choice of an objective prior is obvious. This eliminates any biases in the results, and two researchers who might have differing prior opinions would feel an objective prior is fair. Consider a more extreme situation:\n",
       "\n",
-      "> A tobacco company publishes an report with a Bayesian methodology that retreated 60 years of medical research on tobacco use. Would you believe the results? Unlikely. The researchers probably choose a subjective prior that too strongly biased results in their favor.\n",
+      "> A tobacco company publishes a report with a Bayesian methodology that retreated 60 years of medical research on tobacco use. Would you believe the results? Unlikely. The researchers probably chose a subjective prior that too strongly biased results in their favor.\n",
       "\n",
       "Unfortunately, choosing an objective prior is not as simple as selecting a flat prior, and even today the problem is still not completely solved. The problem with naively choosing the uniform prior is that pathological issues can arise. Some of these issues are pedantic, but we delay more serious issues to the Appendix of this Chapter (TODO)."
      ]
@@ -125,7 +125,7 @@
       "\n",
       "If the posterior does not make sense, then clearly one had an idea what the posterior *should* look like (not what one *hopes* it looks like), implying that the current prior does not contain all the prior information and should be updated. At this point, we can discard the current prior and choose a more reflective one.\n",
       "\n",
-      "Gelman [4] suggests that using a uniform distribution, with a large bounds, is often a good choice for objective priors. Although, one should be wary about using Uniform objective priors with large bounds, as they can assign too large of a prior probability to non-intuitive points. Ask: do you really think the unknown could be incredibly large? Often quantities are naturally biased towards 0. A Normal random variable with large variance (small precision) might be a better choice, or an Exponential with a fat tail in the strictly positive (or negative) case. \n",
+      "Gelman [4] suggests that using a uniform distribution with large bounds is often a good choice for objective priors. Although, one should be wary about using Uniform objective priors with large bounds, as they can assign too large of a prior probability to non-intuitive points. Ask: do you really think the unknown could be incredibly large? Often quantities are naturally biased towards 0. A Normal random variable with large variance (small precision) might be a better choice, or an Exponential with a fat tail in the strictly positive (or negative) case. \n",
       "\n",
       "If using a particularly subjective prior, it is your responsibility to be able to explain the choice of that prior, else you are no better than the tobacco company's guilty parties. "
      ]
@@ -158,7 +158,7 @@
       "\n",
       ">*observed data* $\\Rightarrow$ *prior* $\\Rightarrow$ *observed data* $\\Rightarrow$ *posterior*\n",
       "\n",
-      "Ideally, all prior should be specified *before* we observe the data, so that the data does not influence our prior opinions (see the volumes of research by Daniel Kahnem *et. al* about [anchoring](http://en.wikipedia.org/wiki/Anchoring_and_adjustment) )."
+      "Ideally, all priors should be specified *before* we observe the data, so that the data does not influence our prior opinions (see the volumes of research by Daniel Kahnem *et. al* about [anchoring](http://en.wikipedia.org/wiki/Anchoring_and_adjustment) )."
      ]
     },
     {
@@ -222,7 +222,7 @@
      "source": [
       "### The Wishart distribution\n",
       "\n",
-      "Until now, we have only seen random variables that a scalars. Of course, we can also have *random matrices*! Specifically, the Wishart distribution is a distribution over all [positive semi-definite matrices](http://en.wikipedia.org/wiki/Positive-definite_matrix). Why is this useful to have in our arsenal? (Proper) covariance matrices are positive-definite, hence the Wishart is an appropriate prior for covariance matrices. We can't really visualize a distribution of matrices, so I'll plot some realizations from the $5 \\times 5$ (above) and $20 \\times 20$ (below) Wishart distribution:"
+      "Until now, we have only seen random variables that are scalars. Of course, we can also have *random matrices*! Specifically, the Wishart distribution is a distribution over all [positive semi-definite matrices](http://en.wikipedia.org/wiki/Positive-definite_matrix). Why is this useful to have in our arsenal? (Proper) covariance matrices are positive-definite, hence the Wishart is an appropriate prior for covariance matrices. We can't really visualize a distribution of matrices, so I'll plot some realizations from the $5 \\times 5$ (above) and $20 \\times 20$ (below) Wishart distribution:"
      ]
     },
     {
@@ -336,9 +336,9 @@
       "- Ecology: animals have a finite amount of energy to expend, and following certain behaviours has uncertain rewards. How does the animal maximize its fitness?\n",
       "- Finance: which stock option gives the highest return, under time-varying return profiles.\n",
       "- Clinical trials: a researcher would like to find the best treatment, out of many possible treatment, while minimizing losses. \n",
-      "- Psychology: how does punishment and reward effect our behaviour? How do humans' learn?\n",
+      "- Psychology: how does punishment and reward affect our behaviour? How do humans learn?\n",
       "\n",
-      "Many of these questions above a fundamental to the application's field.\n",
+      "Many of these questions above are fundamental to the application's field.\n",
       "\n",
       "It turns out the *optimal solution* is incredibly difficult, and it took decades for an overall solution to develop. There are also many approximately-optimal solutions which are quite good. The one I wish to discuss is one of the few solutions that can scale incredibly well. The solution is known as *Bayesian Bandits*.\n",
       "\n",
@@ -348,7 +348,7 @@
       "\n",
       "Any proposed strategy is called an *online algorithm* (not in the internet sense, but in the continuously-being-updated sense), and more specifically a reinforcement learning algorithm. The algorithm starts in an ignorant state, where it knows nothing, and begins to acquire data by testing the system. As it acquires data and results, it learns what the best and worst behaviours are (in this case, it learns which bandit is the best). With this in mind, perhaps we can add an additional application of the Multi-Armed Bandit problem:\n",
       "\n",
-      "- Psychology: how does punishment and reward effect our behaviour? How do humans' learn?\n",
+      "- Psychology: how does punishment and reward affect our behaviour? How do humans learn?\n",
       "\n",
       "\n",
       "The Bayesian solution begins by assuming priors on the probability of winning for each bandit. In our vignette we assumed complete ignorance of the these probabilities. So a very natural prior is the flat prior over 0 to 1. The algorithm proceeds as follows:\n",
@@ -664,7 +664,7 @@
       "\\end{align}\n",
       "\n",
       "\n",
-      "where $w_{B(i)}$ is the probability of a prize of the chosen bandit in the $i$ round. A total regret of 0 means the strategy is matching the best possible score. This is likely not possible, as initially our algorithm will often make the wrong choice.  Ideally, a strategy's total regret should flatten as it learns the best bandit. (Mathematically we achieve $w_{B(i)}=w_{opt}$ often)\n",
+      "where $w_{B(i)}$ is the probability of a prize of the chosen bandit in the $i$ round. A total regret of 0 means the strategy is matching the best possible score. This is likely not possible, as initially our algorithm will often make the wrong choice.  Ideally, a strategy's total regret should flatten as it learns the best bandit. (Mathematically, we achieve $w_{B(i)}=w_{opt}$ often)\n",
       "\n",
       "\n",
       "Below we plot the total regret of this simulation, including the scores of some other strategies:\n",
@@ -854,7 +854,7 @@
       "\n",
       "Because of the Bayesian Bandits algorithm's simplicity, it is easy to extend. Some possibilities:\n",
       "\n",
-      "- If interested in the *minimum* probability (eg: where prizes are are bad thing), simply choose $B = \\text{argmin} \\; X_b$ and proceed.\n",
+      "- If interested in the *minimum* probability (eg: where prizes are a bad thing), simply choose $B = \\text{argmin} \\; X_b$ and proceed.\n",
       "\n",
       "- Adding learning rates: Suppose the underlying environment may change over time. Technically the standard Bayesian Bandit algorithm would self-update itself (awesome) by noting that what it thought was the best is starting to fail more often, we can motivate the algorithm to learn changing environments quicker. We simply need to add a *rate* term upon updating:\n",
       "\n",
@@ -872,15 +872,15 @@
       "   1. Sample a random variable $X_b$ from the prior of bandit $b$, for all $b$.\n",
       "   2. Select the bandit with largest sample, i.e. select bandit $B = \\text{argmax}\\;\\; X_b$.\n",
       "   3. Observe the result,$R \\sim f_{y_a}$, of pulling bandit $B$, and update your prior on bandit $B$.\n",
-      "   4. Return to A\n",
+      "   4. Return to 1\n",
       "\n",
       "   The issue is in the sampling of $X_b$ drawing phase. With Beta priors and Bernoulli observations, we have a Beta posterior &mdash; this is easy to sample from. But now, with arbitrary distributions $f$, we have a non-trivial posterior. Sampling from these can be difficult.\n",
       "\n",
-      "- There has been some interest in extending the Bayesian Bandit algorithm to commenting systems. Recall in Chapter 4, we developed a ranking algorithm based on the Bayesian lower-bound of the proportion of upvotes to total total votes. One problem with this approach is that it will bias the top rankings towards older comments, since older comments naturally have more votes (and hence the lower-bound is tighter to the true proportion). This creates a positive feedback cycle where older comments gain more votes, hence are displayed more often, hence gain more votes, etc. This pushes any new, potentially better comments, towards the bottom. J. Neufeld proposes a system to remedy this that uses a Bayesian Bandit solution.\n",
+      "- There has been some interest in extending the Bayesian Bandit algorithm to commenting systems. Recall in Chapter 4, we developed a ranking algorithm based on the Bayesian lower-bound of the proportion of upvotes to total votes. One problem with this approach is that it will bias the top rankings towards older comments, since older comments naturally have more votes (and hence the lower-bound is tighter to the true proportion). This creates a positive feedback cycle where older comments gain more votes, hence are displayed more often, hence gain more votes, etc. This pushes any new, potentially better comments, towards the bottom. J. Neufeld proposes a system to remedy this that uses a Bayesian Bandit solution.\n",
       "\n",
       "His proposal is to consider each comment as a Bandit, with a the number of pulls equal to the number of votes cast, and number of rewards as the number of upvotes, hence creating a $\\text{Beta}(1+U,1+D)$ posterior. As visitors visit the page, samples are drawn from each bandit/comment, but instead of displaying the comment with the $\\max$ sample, the comments are ranked according the the ranking of their respective samples. From J. Neufeld's blog [7]:\n",
       "\n",
-      "   > [The] resulting ranking algorithm is quite straightforward, each new time the comments page is loaded, the score for each comment is sampled from a $\\text{Beta}(1+U,1+D)$, comments are then ranked by this score in descending order... This randomization has a unique benefit in that even untouched comments $(U=1,D=0)$ have some chance of being seen even in threads with 5000+ comments (something that is not happening now), but, at the same time, the user will is not likely to be inundated with rating these new comments. "
+      "   > [The] resulting ranking algorithm is quite straightforward, each new time the comments page is loaded, the score for each comment is sampled from a $\\text{Beta}(1+U,1+D)$, comments are then ranked by this score in descending order... This randomization has a unique benefit in that even untouched comments $(U=1,D=0)$ have some chance of being seen even in threads with 5000+ comments (something that is not happening now), but, at the same time, the user is not likely to be inundated with rating these new comments. "
      ]
     },
     {
@@ -1277,7 +1277,7 @@
      "source": [
       "Why did this occur? Recall how I mentioned that finance has a very very low signal to noise ratio. This implies an environment where inference is much more difficult. One should be careful about over interpreting these results: notice (in the first figure) that each distribution is positive at 0, implying that the stock may return nothing. Furthermore, the subjective priors influenced the results. From the fund managers point of view, this is good as it reflects his updated beliefs about the stocks, whereas from a neutral viewpoint this can be too subjective of a result.  \n",
       "\n",
-      "Below we show the posterior correlation matrix, and posterior standard deviations. An important caveat to know is that the Wishart distribution models the *inverse covariance matrix*, so we must invert it to get the covariance matrix. We also normalize the matrix to acquire the *correlation matrix*. Since we cannot plot hundreds of matrices effectively, we settle my summarizing the posterior distribution of correlation matrices of showing the *mean posterior correlation matrix* (defined on line 2)."
+      "Below we show the posterior correlation matrix, and posterior standard deviations. An important caveat to know is that the Wishart distribution models the *inverse covariance matrix*, so we must invert it to get the covariance matrix. We also normalize the matrix to acquire the *correlation matrix*. Since we cannot plot hundreds of matrices effectively, we settle by summarizing the posterior distribution of correlation matrices by showing the *mean posterior correlation matrix* (defined on line 2)."
      ]
     },
     {
@@ -1368,7 +1368,7 @@
       "\n",
       "$$ \\underbrace{\\text{Beta}}_{\\text{prior}} \\cdot \\overbrace{\\text{Binomial}}^{\\text{data}} = \\overbrace{\\text{Beta}}^{\\text{posterior} } $$ \n",
       "\n",
-      "Notice the $\\text{Beta}$ on both sides of this equation (no, you cannot cancel them, this is not a *real* equation). This is a really useful property. It allows us avoid using MCMC, since the posterior is known in closed form. Hence inference and analytics are easy to derive. This shortcut was the heart of the  Bayesian Bandit algorithm above. Fortunately, there is an entire family of distributions that have similar behaviour.  \n",
+      "Notice the $\\text{Beta}$ on both sides of this equation (no, you cannot cancel them, this is not a *real* equation). This is a really useful property. It allows us to avoid using MCMC, since the posterior is known in closed form. Hence inference and analytics are easy to derive. This shortcut was the heart of the  Bayesian Bandit algorithm above. Fortunately, there is an entire family of distributions that have similar behaviour.  \n",
       "\n",
       "Suppose $X$ comes from, or is believed to come from, a well-known distribution, call it $f_{\\alpha}$, where $\\alpha$ are possibly unknown parameters of $f$. $f$ could be a Normal distribution, or Binomial distribution, etc. For particular distributions $f_{\\alpha}$, there may exist a prior distribution $p_{\\beta}$, such that:\n",
       "\n",