Merge pull request CamDavidsonPilon#58 from pablooliveira/Chapter3-fixtypos

CamDavidsonPilon · CamDavidsonPilon · commit fc6a48d568ed · 2013-06-02T08:23:24.000-07:00
Fix typos in Chapter 3
diff --git a/Chapter3_MCMC/IntroMCMC.ipynb b/Chapter3_MCMC/IntroMCMC.ipynb
@@ -23,7 +23,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "The previous two chapters hid the inner-mechanics of PyMC, and more generally Monte Carlo Markov Chains (MCMC), from the reader. The reason for including this chapter is three-fold. The first is that any book on Bayesian inference must discuss MCMC. I cannot fight this. Blame the statisticians. Secondly, knowing the process of MCMC gives you insight into whether your algorithm has converged. (Converged to what? We will get to that) Thirdly, we'll understand *why* we are returned thousands of samples from the positerior as a solution, which at first thought can be odd. "
+      "The previous two chapters hid the inner-mechanics of PyMC, and more generally Monte Carlo Markov Chains (MCMC), from the reader. The reason for including this chapter is three-fold. The first is that any book on Bayesian inference must discuss MCMC. I cannot fight this. Blame the statisticians. Secondly, knowing the process of MCMC gives you insight into whether your algorithm has converged. (Converged to what? We will get to that) Thirdly, we'll understand *why* we are returned thousands of samples from the posterior as a solution, which at first thought can be odd. "
      ]
     },
     {
@@ -32,7 +32,7 @@
      "source": [
       "### The Bayesian landscape\n",
       "\n",
-      "When we setup a Bayesian inference problem with $N$ unknowns, we are implicitly creating an $N$ dimensional space for the prior distributions to exist in. Associated with the space is an additional dimension, which we can describe as the *surface*, or *curve*, that sits ontop of the space, that reflects the *prior probability* of a particular point. The surface on the space is defined by our prior distributions. For example, if we have two unknowns $p_1$ and $p_2$, and priors for both are $\\text{Uniform}(0,5)$, the space created is a square of length 5 and the surface is a flat plane that sits ontop of the square (representing that every point is equally likely). "
+      "When we setup a Bayesian inference problem with $N$ unknowns, we are implicitly creating an $N$ dimensional space for the prior distributions to exist in. Associated with the space is an additional dimension, which we can describe as the *surface*, or *curve*, that sits on top of the space, that reflects the *prior probability* of a particular point. The surface on the space is defined by our prior distributions. For example, if we have two unknowns $p_1$ and $p_2$, and priors for both are $\\text{Uniform}(0,5)$, the space created is a square of length 5 and the surface is a flat plane that sits on top of the square (representing that every point is equally likely). "
      ]
     },
     {
@@ -80,7 +80,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "Alternatively, if the two priors are $\\text{Exp}(3)$ and $\\text{Exp}(10)$, then the space is all postive numbers on the 2-D plane, and the surface induced by the priors looks like a water fall that starts at the point (0,0) and flows over the positive numbers. \n",
+      "Alternatively, if the two priors are $\\text{Exp}(3)$ and $\\text{Exp}(10)$, then the space is all positive numbers on the 2-D plane, and the surface induced by the priors looks like a water fall that starts at the point (0,0) and flows over the positive numbers. \n",
       "\n",
       "The plots below visualize this. The more dark red the color, the more prior probability is assigned to that location. Conversely, areas with darker blue represent that our priors assign very low probability to that location. "
      ]
@@ -374,7 +374,7 @@
       "\n",
       "    taus = 1.0/mc.Uniform( \"stds\", 0, 100, size= 2)**2 \n",
       "\n",
-      "Notice that we specified `size=2`: we are modelling both $\\tau$s as a single PyMC variable. Note that is does not induce a necessary relationship between the two $\\tau$s, it is simply for succinctness.\n",
+      "Notice that we specified `size=2`: we are modeling both $\\tau$s as a single PyMC variable. Note that is does not induce a necessary relationship between the two $\\tau$s, it is simply for succinctness.\n",
       "\n",
       "We also need to specify priors on the centers of the clusters. The centers are really the $\\mu$ parameters in this Normal distributions. Their priors can be modeled by a Normal distribution. Looking at the data, I have an idea where the two centers might be &mdash; I would guess somewhere around 120 and 190 respectively, though I am not very confident in these eyeballed estimates. Hence I will set $\\mu_0 = 120, \\mu_1 = 190$ and $\\sigma_{0,1} = 10$ (recall we enter the $\\tau$ parameter, so enter $1/\\sigma^2 = 0.01$ in the PyMC variable.)"
      ]
@@ -388,7 +388,7 @@
       "centers = mc.Normal( \"centers\", [120, 190], [0.01, 0.01], size =2 )\n",
       "\n",
       "\"\"\"\n",
-      "The below determinsitic functions map a assingment, in this case 0 or 1,\n",
+      "The below deterministic functions map a assignment, in this case 0 or 1,\n",
       "to a set of parameters, located in the (1,2) arrays `taus` and `centers.`\n",
       "\"\"\"\n",
       "\n",
@@ -834,9 +834,9 @@
       "#### Returning to Clustering: Prediction\n",
       "The above clustering can be generalized to $k$ clusters. Choosing $k=2$ allowed us to visualize the MCMC better, and examine some very interesting plots. \n",
       "\n",
-      "What about prediction? Suppose we observe a new data point, say $x = 175$, and we wish to label it to a cluster. It is foolish to simply assign it to the *closer* cluster center, as this ignores the standard deviation of the clusters, and we have seen from the plots aboves that this consideration is very important. More formally: we are interested in the *probability* (as we cannot be certain about labels) of assigning $x=175$ to cluster 1. Denote the assignment of $x$ as $L_x$, which is equal to 0 or 1, and we are interested in $P(L_x = 1 \\;|\\; x = 175 )$.  \n",
+      "What about prediction? Suppose we observe a new data point, say $x = 175$, and we wish to label it to a cluster. It is foolish to simply assign it to the *closer* cluster center, as this ignores the standard deviation of the clusters, and we have seen from the plots above that this consideration is very important. More formally: we are interested in the *probability* (as we cannot be certain about labels) of assigning $x=175$ to cluster 1. Denote the assignment of $x$ as $L_x$, which is equal to 0 or 1, and we are interested in $P(L_x = 1 \\;|\\; x = 175 )$.  \n",
       "\n",
-      "A naive method to compute this is to re-reun the above MCMC with the additional data point appended. The disadvantage with this method is that it will be slow to infer for each novel data point. Alternatively, we can try a *less precise*, but much quicker method. \n",
+      "A naive method to compute this is to re-run the above MCMC with the additional data point appended. The disadvantage with this method is that it will be slow to infer for each novel data point. Alternatively, we can try a *less precise*, but much quicker method. \n",
       "\n",
       "We will use Bayes Theorem for this. If you recall, Bayes Theorem looks like:\n",
       "\n",
@@ -909,15 +909,15 @@
       "    map_ = mc.MAP( model )\n",
       "    map.fit()\n",
       "\n",
-      "The `MAP.fit()` methods has the flexibility of allowing the user to choose which opimization algorithm to use (after all, this is a optimization problem: we are looking for the values that maximize our landscape), as not all optimization algorithms are created equal. The default optimization algorithm in the call to `fit` is scipy's `fmin` algorithm (which attemps to minimize the *negative of the landscape*). An alternative algorithm that is available is Powell's Method, a favourite of PyMC blogger [Abraham Flaxman](http://healthyalgorithms.com/) [1], by calling `fit(method='fmin_powell')`. From my experience, I use the default, but if my convergence is slow or not guaranteed, I experiment with Powell's method. \n",
+      "The `MAP.fit()` methods has the flexibility of allowing the user to choose which optimization algorithm to use (after all, this is a optimization problem: we are looking for the values that maximize our landscape), as not all optimization algorithms are created equal. The default optimization algorithm in the call to `fit` is scipy's `fmin` algorithm (which attempts to minimize the *negative of the landscape*). An alternative algorithm that is available is Powell's Method, a favourite of PyMC blogger [Abraham Flaxman](http://healthyalgorithms.com/) [1], by calling `fit(method='fmin_powell')`. From my experience, I use the default, but if my convergence is slow or not guaranteed, I experiment with Powell's method. \n",
       "\n",
       "The MAP can also be used as a solution to the inference problem, as mathematically it  is the *most likely* value for the unknowns. But as mentioned earlier in this chapter,  this location ignores the uncertainty and doesn't return a distribution.\n",
       "\n",
       "Typically, it is always a good idea, and rarely a bad idea, to prepend your call to `mcmc` with a call to `MAP(model).fit()`. The intermediate call to `fit` is hardly computationally intensive, and will save you time later due to a shorter burn-in period. \n",
       "\n",
       "#### Speaking of the burn-in period\n",
       "\n",
-      "Tt is still a good idea to provide a burn-in period, even if we are using `MAP` prior to calling `MCMC.sample`, just to be safe. We can have PyMC automatically discard the first $n$ samples by specifying the `burn` parameter in the call to `sample`. As one does not know when the chain has fully converged, I like to assign the first *half* of my samples to be discarded, sometimes up to 90% of my samples for longer runs. To continue the clustering example from above, my new code would look something like:\n",
+      "It is still a good idea to provide a burn-in period, even if we are using `MAP` prior to calling `MCMC.sample`, just to be safe. We can have PyMC automatically discard the first $n$ samples by specifying the `burn` parameter in the call to `sample`. As one does not know when the chain has fully converged, I like to assign the first *half* of my samples to be discarded, sometimes up to 90% of my samples for longer runs. To continue the clustering example from above, my new code would look something like:\n",
       "\n",
       "    model = mc.Model( [p, assignment, taus, centers ] )\n",
       "\n",
@@ -1079,7 +1079,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "With more thinning, the autocorrelation drops quicker. There is a tradeoff though: higher thinning requires more MCMC iterations to achieve the same number of retured samples. For example, 10 000 samples unthinned is 100 000 with a thinning of 10 (though the latter has less autocorrelation). \n",
+      "With more thinning, the autocorrelation drops quicker. There is a tradeoff though: higher thinning requires more MCMC iterations to achieve the same number of returned samples. For example, 10 000 samples unthinned is 100 000 with a thinning of 10 (though the latter has less autocorrelation). \n",
       "\n",
       "What is a good amount of thinning. The returned samples will always exhibit some autocorrelation, regardless of how much thinning is done. So long as the autocorrelation tends to zero, you are probably ok. Typically thinning of more than 10 is not necessary.\n",
       "\n",
@@ -1174,13 +1174,13 @@
       "\n",
       "### Intelligent starting values\n",
       "\n",
-      "It would be great to start the MCMC algorithm off near the posterior distribution, so that it will take little time to start sampling correctly. We can aid the algorithm by telling where we *think* the posterior distribution will be by specifying the `value` parameter in the `Stochastic` variable creation. Often we posses guess about this anyways. For example, if we have data from a Normal distribution, and we wish to estimate the $\\mu$ paramter, then a good starting value would the *mean* of the data. \n",
+      "It would be great to start the MCMC algorithm off near the posterior distribution, so that it will take little time to start sampling correctly. We can aid the algorithm by telling where we *think* the posterior distribution will be by specifying the `value` parameter in the `Stochastic` variable creation. Often we posses guess about this anyways. For example, if we have data from a Normal distribution, and we wish to estimate the $\\mu$ parameter, then a good starting value would the *mean* of the data. \n",
       "\n",
       "     mu = mc.Uniform( \"mu\", 0, 100, value = data.mean() )\n",
       "\n",
-      "For most parameters in models, there is a frequentist esimate of it. These estimates are a good starting value for our MCMC algoithms. Of course, this is not always possible for some variables, but including as many appropriate initial values is always a good idea. Even if your guesses are wrong, the MCMC will still converge to the proper distribution, so there is little to lose.\n",
+      "For most parameters in models, there is a frequentist estimate of it. These estimates are a good starting value for our MCMC algorithms. Of course, this is not always possible for some variables, but including as many appropriate initial values is always a good idea. Even if your guesses are wrong, the MCMC will still converge to the proper distribution, so there is little to lose.\n",
       "\n",
-      "This is what using `MAP` tries to do, by giving good initial values to the MCMC. So why bother specifing user-defined values? Well, even giving `MAP` good values will help it find the maximum a-posterior. \n",
+      "This is what using `MAP` tries to do, by giving good initial values to the MCMC. So why bother specifying user-defined values? Well, even giving `MAP` good values will help it find the maximum a-posterior. \n",
       "\n",
       "Also important, *bad initial values* are a source of major bugs in PyMC and can hurt convergence.\n",
       "\n",