edits to mcmc in chapter1

CamDavidsonPilon · CamDavidsonPilon · commit 2380dba12a75 · 2013-03-19T23:06:56.000-04:00
diff --git a/Chapter1_Introduction/Chapter1_Introduction.ipynb b/Chapter1_Introduction/Chapter1_Introduction.ipynb
@@ -92,30 +92,31 @@
       "\n",
       "###Bayesian Inference in Practice\n",
       "\n",
-      " If frequentist and Bayesian inference were computer programming functions, with inputs being statistical problems, then the two would be different in what they return to the user. The frequentist inference function would return a number, whereas the Bayesian function would return a *distribution*.\n",
+      " If frequentist and Bayesian inference were programming functions, with inputs being statistical problems, then the two would be different in what they return to the user. The frequentist inference function would return a number, whereas the Bayesian function would return *probabilties*.\n",
       "\n",
-      "For example, in our debugging problem above, calling the frequentist function with the argument \"My code passed all $X$ tests; is my code bug-free?\" would return a *YES*. On the other hand, asking our Bayesian function \"Often my code has bugs. My code passed all $X$ tests; is my code bug-free?\" would return something very different: a distribution over *YES* and *NO*. The function might return \n",
+      "For example, in our debugging problem above, calling the frequentist function with the argument \"My code passed all $X$ tests; is my code bug-free?\" would return a *YES*. On the other hand, asking our Bayesian function \"Often my code has bugs. My code passed all $X$ tests; is my code bug-free?\" would return something very different: a probabilities of *YES* and *NO*. The function might return:\n",
       "\n",
       "\n",
       ">    *YES*, with probability 0.8; *NO*, with probability 0.2\n",
       "\n",
       "\n",
       "\n",
-      "This is very different from the answer the frequentist function returned. Notice that the Bayesian function accepted an additional argument:  *\"Often my code has bugs\"*. This parameter is the *prior*. By including the prior parameter, we are telling the Bayesian function to include our personal belief about the situation. Technically this parameter in the Bayesian function is optional, but we will see excluding it has its own consequences. \n",
+      "This is very different from the answer the frequentist function returned. Notice that the Bayesian function accepted an additional argument:  *\"Often my code has bugs\"*. This parameter is the *prior*. By including the prior parameter, we are telling the Bayesian function to include our belief about the situation. Technically this parameter in the Bayesian function is optional, but we will see excluding it has its own consequences. \n",
       "\n",
+      "####Incorporating evidence\n",
       "\n",
-      "As we acquire more and more instances of evidence, our prior belief is *washed out* by the new evidence. This is to be expected. For example, if your prior belief is something ridiculous, like \"I expect the sun to explode today\", and each day you are proved wrong, you would hope that any inference would correct you, or at least align your beliefs better. \n",
+      "As we acquire more and more instances of evidence, our prior belief is *washed out* by the new evidence. This is to be expected. For example, if your prior belief is something ridiculous, like \"I expect the sun to explode today\", and each day you are proved wrong, you would hope that any inference would correct you, or at least align your beliefs better. Bayesian inference will correct this belief.\n",
       "\n",
       "\n",
-      "Denote $N$ as the number of instances of evidence we possess. As we gather an *infinite* amount of evidence, say as $N \\rightarrow \\infty$, our Bayesian results align with frequentist results. Hence for large $N$, statistical inference is more or less objective. On the other hand, for small $N$, inference is much more *unstable*: frequentist estimates have more variance and larger confidence intervals. This is where Bayesian analysis excels. By introducing a prior, and returning a distribution (instead of a scalar estimate), we *preserve the uncertainity* to reflect the instability of stasticial inference of a small $N$ dataset. \n",
+      "Denote $N$ as the number of instances of evidence we possess. As we gather an *infinite* amount of evidence, say as $N \\rightarrow \\infty$, our Bayesian results align with frequentist results. Hence for large $N$, statistical inference is more or less objective. On the other hand, for small $N$, inference is much more *unstable*: frequentist estimates have more variance and larger confidence intervals. This is where Bayesian analysis excels. By introducing a prior, and returning probabilities (instead of a scalar estimate), we *preserve the uncertainity* that reflects the instability of stasticial inference of a small $N$ dataset. \n",
       "\n",
-      "One may think that for large $N$, one can be indifferent between the two techniques, and might lean towards the computational-simpler, frequentist methods. An analyst in this position should consider the following quote by Andrew Gelman (2005)[1], before making such a decision:\n",
+      "One may think that for large $N$, one can be indifferent between the two techniques since they offer similar inference, and might lean towards the computational-simpler, frequentist methods. An individual in this position should consider the following quote by Andrew Gelman (2005)[1], before making such a decision:\n",
       "\n",
       "> Sample sizes are never large. If $N$ is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once $N$ is \"large enough,\" you can start subdividing the data to learn more (for example, in a public opinion poll, once you have a good estimate for the entire country, you can estimate among men and women, northerners and southerners, different age groups, etc etc). $N$ is never enough because if it were \"enough\" you'd already be on to the next problem for which you need more data.\n",
       "\n",
       "\n",
       "#### A note on *Big Data*\n",
-      "Paradoxically, big data's predictive analytic problems are actually solved by relatively simple models [2][4]. Thus we can argue that big data's prediction difficulty does not lie in the algorithm used, but instead on the computational difficulties of storage and execution on big data. (One should also consider Gelman's qoute from above and ask \"Do I really have big data?\" )\n",
+      "Paradoxically, big data's predictive analytic problems are actually solved by relatively simple algorithms [2][4]. Thus we can argue that big data's prediction difficulty does not lie in the algorithm used, but instead on the computational difficulties of storage and execution on big data. (One should also consider Gelman's qoute from above and ask \"Do I really have big data?\" )\n",
       "\n",
       "The much more difficult analytic problems involve *medium data* and, especially troublesome, *really small data*. Using a similar argument as  Gelman's above, if big data problems are *big enough* to be readily solved, then we should be more interested in the *not-quite-big enough* datasets. "
      ]
@@ -126,16 +127,16 @@
      "source": [
       "### Our Bayesian framework\n",
       "\n",
-      "We are interested in beliefs, which can be interpreted as probabilities by thinking Bayesian. We have a *prior* belief in event $A$: it is what you believe before looking at any evidence, e.g., our prior belief about bugs being in our code before performing tests.\n",
+      "We are interested in beliefs, which can be interpreted as probabilities by thinking Bayesian. We have a *prior* belief in event $A$, beliefs formed by previous information, e.g., our prior belief about bugs being in our code before performing tests.\n",
       "\n",
-      "Secondly, we observe our evidence. To continue our example, if our code passes $X$ tests, we want to update our belief to incorporate this. We call this new belief the *posterior* probability. Updating our belief is done via the following equation, known as Bayes' Theorem, after Thomas Bayes:\n",
+      "Secondly, we observe our evidence. To continue our buggy-code example: if our code passes $X$ tests, we want to update our belief to incorporate this. We call this new belief the *posterior* probability. Updating our belief is done via the following equation, known as Bayes' Theorem, after it's discoverer Thomas Bayes:\n",
       "\n",
       "\\begin{align}\n",
       " P( A | X ) = & \\frac{ P(X | A) P(A) } {P(X) } \\\\\\\\[5pt]\n",
       "& \\propto P(X | A) P(A)\\;\\; (\\propto \\text{is proportional to } )\n",
       "\\end{align}\n",
       "\n",
-      "The above formula is not unique to Bayesian inference: it is a mathematical fact with uses outside Bayesian inference. Bayesian inference merely uses it to connect $P(A)$ with an updated $P(A | X )$."
+      "The above formula is not unique to Bayesian inference: it is a mathematical fact with uses outside Bayesian inference. Bayesian inference merely uses it to connect prior probabilities $P(A)$ with an updated posterior probabilties $P(A | X )$."
      ]
     },
     {
@@ -147,7 +148,7 @@
       "\n",
       "Let $A$ denote the event that our code has **no bugs** in it. Let $X$ denote the event that the code passes all debugging tests. For now, we will leave the prior probability of no bugs as a variable, i.e. $P(A) = p$. \n",
       "\n",
-      "We are interested in $P(A|X)$, i.e. the probability of no bugs, given our debugging tests $X$. To use the formula above, we need to compute some quantities from the formula above.\n",
+      "We are interested in $P(A|X)$, i.e. the probability of no bugs, given our debugging tests $X$. To use the formula above, we need to compute some quantities.\n",
       "\n",
       "What is $P(X | A)$, i.e., the probability that the code passes $X$ tests *given* there are no bugs? Well, it is equal to 1, for a code with no bugs will pass all tests. \n",
       "\n",
@@ -169,13 +170,13 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "We have already computed $P(X|A)$ above. On the other hand, $P(X | \\sim A)$ is subjective: our code can pass tests but still have a bug in it, though the probability there is a bug present is less. Note this is dependent on the number of tests performed, the degree of complication in the tests, etc. Let's be conservative and assign $P(X|\\sim A) = 0.5$. Then\n",
+      "We have already computed $P(X|A)$ above. On the other hand, $P(X | \\sim A)$ is subjective: our code can pass tests but still have a bug in it, though the probability there is a bug present is reduced. Note this is dependent on the number of tests performed, the degree of complication in the tests, etc. Let's be conservative and assign $P(X|\\sim A) = 0.5$. Then\n",
       "\n",
       "\\begin{align}\n",
       "P(A | X) & = \\frac{1\\cdot p}{ 1\\cdot p +0.5 (1-p) } \\\\\\\\\n",
       "& = \\frac{ 2 p}{1+p}\n",
       "\\end{align}\n",
-      "This is the posterior probability distribution. What does it look like as a function of our prior, $p \\in [0,1]$? "
+      "This is the posterior probability. What does it look like as a function of our prior, $p \\in [0,1]$? "
      ]
     },
     {
@@ -207,11 +208,11 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "We can see the biggest gains if we observe the $X$ tests passed are when the prior probability, $p$, is low. Let's settle on a specific value for the prior. I'm a (I think) strong programmer, so I'm going to give myself a realistic prior of 0.20, that is, there is a 20% chance that I write code bug-free. To be more realistic, this prior should be a function of how complicated and large the code is, but let's pin it at 0.20. Then my updated belief that my code is bug-free is 0.33. \n",
+      "We can see the biggest gains if we observe the $X$ tests passed are when the prior probability, $p$, is low. Let's settle on a specific value for the prior. I'm a strong programmer (I think), so I'm going to give myself a realistic prior of 0.20, that is, there is a 20% chance that I write code bug-free. To be more realistic, this prior should be a function of how complicated and large the code is, but let's pin it at 0.20. Then my updated belief that my code is bug-free is 0.33. \n",
       "\n",
-      "Recall that the prior is a probability distribution: $p$ is the prior probability that there *are no bugs*, so $1-p$ is the prior probability that there *are bugs*.\n",
+      "Recall that the prior is a probability: $p$ is the prior probability that there *are no bugs*, so $1-p$ is the prior probability that there *are bugs*.\n",
       "\n",
-      "Similarly, our posterior is also a probability distribution, with $P(A | X)$ the probability there is no bug *given we saw all tests pass*, hence $1-P(A|X)$ is the probability there is a bug *given all tests passed*. What does our posterior probability distribution look like? Below is a graph of both the prior and the posterior distributions. \n"
+      "Similarly, our posterior is also a probability, with $P(A | X)$ the probability there is no bug *given we saw all tests pass*, hence $1-P(A|X)$ is the probability there is a bug *given all tests passed*. What does our posterior probability look like? Below is a graph of both the prior and the posterior probabilities. \n"
      ]
     },
     {
@@ -247,7 +248,7 @@
      "source": [
       "Notice that after we observed $X$ occur, the probability of bugs being absent increased. By increasing the number of tests, we can approach confidence (probability 1) that there are no bugs present.\n",
       "\n",
-      "This was a very simple example of Bayesian inference and Bayes rule. Unfortunately, the mathematics necessary to perform more complicated Bayesian inference only becomes more difficult, except for artifically constructed cases. We will later see that this type of mathematical anaylsis is actually unnecessary. First we must broaden our modeling tools."
+      "This was a very simple example of Bayesian inference and Bayes rule. Unfortunately, the mathematics necessary to perform more complicated Bayesian inference only becomes more difficult, except for artifically constructed cases. We will later see that this type of mathematical anaylsis is actually unnecessary. First we must broaden our modeling tools. The next section deals with *probability distributions*. If you are already familiar, feel free to skip (or at least skim), but for the less familiar the next section is essential."
      ]
     },
     {
@@ -261,7 +262,7 @@
       "\n",
       "**Let's quickly recall what a probability distribution is:** Let $Z$ be some random variable. Then associated with $Z$ is a *probability distribution function* that assigns probabilities to the different outcomes $Z$ can take. There are three cases:\n",
       "\n",
-      "-   **$Z$ is discrete**: Discrete random variables may only assume values on a specified list. Things like populations, movie ratings, and number of votes are all discrete random variables. It's more clear when we contrast it with...\n",
+      "-   **$Z$ is discrete**: Discrete random variables may only assume values on a specified list. Things like populations, movie ratings, and number of votes are all discrete random variables. Discrete random variables beome more clear when we contrast them with...\n",
       "\n",
       "-   **$Z$ is continuous**: Continuous random variable can take on arbitrarily exact values. For example, temperature, speed, time, color are all modeled as continuous variables because you can constantly make the values more and more precise.\n",
       "\n",
@@ -372,7 +373,7 @@
      "metadata": {},
      "source": [
       "\n",
-      "###But what is $\\lambda \\;\\;$?\n",
+      "###But what is $\\lambda \\;$?\n",
       "\n",
       "\n",
       "**This question is what motivates statistics**. In the real world, $\\lambda$ is hidden from us. We only see $Z$, and must go backwards to try and determine $\\lambda$. The problem is so difficult because there is not a one-to-one mapping from $Z$ to $\\lambda$. Many different methods have been created to solve the problem of estimating $\\lambda$, but since $\\lambda$ is never actually observed, no one can say for certain which method is better! \n",