You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Chapter2_MorePyMC/MorePyMC.ipynb
+21-21Lines changed: 21 additions & 21 deletions
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@
15
15
"======\n",
16
16
"______\n",
17
17
"\n",
18
-
"This chapter introduces more PyMC syntax and design patterns, and ways to think about how to model a system from a Bayesian perspective. It also contains tips and data visualization techniques for assesing goodness-of-fit for your Bayesian model."
18
+
"This chapter introduces more PyMC syntax and design patterns, and ways to think about how to model a system from a Bayesian perspective. It also contains tips and data visualization techniques for assessing goodness-of-fit for your Bayesian model."
19
19
]
20
20
},
21
21
{
@@ -188,7 +188,7 @@
188
188
"source": [
189
189
"PyMC is concerned with two types of programming variables: `stochastic` and `deterministic`.\n",
190
190
"\n",
191
-
"* *stochastic variables* are variables that are not deterministic, i.e., even if you knew all the values of the variables' parents (if it even has any parents), it would still be random. Included in this catagory are instances of classes `Poisson`, `DiscreteUniform`, and `Exponential`.\n",
191
+
"* *stochastic variables* are variables that are not deterministic, i.e., even if you knew all the values of the variables' parents (if it even has any parents), it would still be random. Included in this category are instances of classes `Poisson`, `DiscreteUniform`, and `Exponential`.\n",
192
192
"\n",
193
193
"* *deterministic variables* are variables that are not random if the variables' parents were known. This might be confusing at first: a quick mental check is *if I knew all of variable `foo`'s parent variables, I could determine what `foo`'s value is.* \n",
194
194
"\n",
@@ -335,7 +335,7 @@
335
335
"cell_type": "markdown",
336
336
"metadata": {},
337
337
"source": [
338
-
"#### Determinstic variables\n",
338
+
"#### Deterministic variables\n",
339
339
"\n",
340
340
"Since most variables you will be modeling are stochastic, we distinguish deterministic variables with a `pymc.deterministic` wrapper. (If you are unfamiliar with Python wrappers (also called decorators), that's no problem. Just prepend the `pymc.deterministic` decorator before the variable declaration and you're good to go. No need to know more. ) The declaration of a deterministic variable uses a Python function:\n",
341
341
"\n",
@@ -345,7 +345,7 @@
345
345
"\n",
346
346
"For all purposes, we can treat the object `some_deterministic_var` as a variable and not a Python function. \n",
347
347
"\n",
348
-
"Prepending with the wrapper is the easiest way, but not the only way, to create deterministic variables. This is not completely true: elementary operations, like addition, exponentials etc. implicity create determinsitic variables. For example, the following returns a deterministic variable:"
348
+
"Prepending with the wrapper is the easiest way, but not the only way, to create deterministic variables. This is not completely true: elementary operations, like addition, exponentials etc. implicitly create deterministic variables. For example, the following returns a deterministic variable:"
349
349
]
350
350
},
351
351
{
@@ -464,7 +464,7 @@
464
464
"source": [
465
465
"To frame this in the notation of the first chapter, though this is a slight abuse of notation, we have specified $P(A)$. Our next goal is to include data/evidence/observations $X$ into our model. \n",
466
466
"\n",
467
-
"PyMC stochastic variables have a keyword argument `observed` which accepts a boolean (`False` by default). The keyword `observed` has a very simple role: fix the variable's current value, i.e. make `value` immutable. We have to specify an intial `value` in the variable's creation, equal to the observations we wish to include, typically an array (and it should be an Numpy array for speed). For example:"
467
+
"PyMC stochastic variables have a keyword argument `observed` which accepts a boolean (`False` by default). The keyword `observed` has a very simple role: fix the variable's current value, i.e. make `value` immutable. We have to specify an initial `value` in the variable's creation, equal to the observations we wish to include, typically an array (and it should be an Numpy array for speed). For example:"
"PyMC, and other probablistic programming languages, have been designed to tell these data-generation *stories*. More generally, B. Cronin writes [5]:\n",
591
+
"PyMC, and other probabilistic programming languages, have been designed to tell these data-generation *stories*. More generally, B. Cronin writes [5]:\n",
592
592
"\n",
593
593
"> Probabilistic programming will unlock narrative explanations of data, one of the holy grails of business analytics and the unsung hero of scientific persuasion. People think in terms of stories - thus the unreasonable power of the anecdote to drive decision-making, well-founded or not. But existing analytics largely fails to provide this kind of story; instead, numbers seemingly appear out of thin air, with little of the causal context that humans prefer when weighing their options."
594
594
]
@@ -759,7 +759,7 @@
759
759
"source": [
760
760
"## An algorithm for human deceit\n",
761
761
"\n",
762
-
"Likely the most common statistical task is estimating the frequency of events. However, there is a difference between the *observed frequency* and the *true frequency* of an event. The true frequency can be interpreted as the probability of an event occuring. For example, the true frequency of rolling a 1 on a 6-sided die is 0.166. Knowing the frequency of events like baseball home runs, frequency of social attributes, fraction of internet users with cats etc. are common requests we ask of Nature. Unfortunately, in general Nature hides the true frequency from us and we must *infer* it from observed data.\n",
762
+
"Likely the most common statistical task is estimating the frequency of events. However, there is a difference between the *observed frequency* and the *true frequency* of an event. The true frequency can be interpreted as the probability of an event occurring. For example, the true frequency of rolling a 1 on a 6-sided die is 0.166. Knowing the frequency of events like baseball home runs, frequency of social attributes, fraction of internet users with cats etc. are common requests we ask of Nature. Unfortunately, in general Nature hides the true frequency from us and we must *infer* it from observed data.\n",
763
763
"\n",
764
764
"The *observed frequency* is then the frequency we observe: say rolling the die 100 times you may observe 20 rolls of 1. The observed frequency, 0.2, differs from the true frequency, 0.166. We can use Bayesian statistics to infer probable values of the true frequency using an appropriate prior and observed data.\n",
765
765
"\n",
@@ -769,7 +769,7 @@
769
769
"\n",
770
770
"### The Binomial Distribution\n",
771
771
"\n",
772
-
"The binomial distribution is one of the most popular distributions, mostly because of its simplicity and usefulness. Unlike the other distributions we have encountered thus far in the book, the binomial distribution has 2 parameters: $N$, a positive integer representing $N$ trials or number of instances of potential events, and $p$, the probability of an event occuring in a single trial. Like the Poisson distribution, it is a discrete distribution, but unlike the Poisson distribution, it only weighs integers from $0$ to $N$. The mass distribution looks like:\n",
772
+
"The binomial distribution is one of the most popular distributions, mostly because of its simplicity and usefulness. Unlike the other distributions we have encountered thus far in the book, the binomial distribution has 2 parameters: $N$, a positive integer representing $N$ trials or number of instances of potential events, and $p$, the probability of an event occurring in a single trial. Like the Poisson distribution, it is a discrete distribution, but unlike the Poisson distribution, it only weighs integers from $0$ to $N$. The mass distribution looks like:\n",
773
773
"\n",
774
774
"$$P( X = k ) = {{N}\\choose{k}} p^k(1-p)^{N-k}$$\n",
775
775
"\n",
@@ -829,7 +829,7 @@
829
829
"cell_type": "markdown",
830
830
"metadata": {},
831
831
"source": [
832
-
"##### Example: Cheating amoung students\n",
832
+
"##### Example: Cheating among students\n",
833
833
"\n",
834
834
"We will use the binomial distribution to determine the frequency of students cheating during an exam. If we let $N$ be the total number of students who took the exam, and assuming each student is interviewed post-exam (answering without consequence), we will receive integer $X$ \"Yes I did cheat\" answers. We then find the posterior distribution of $p$, given $N$, some specified prior on $p$, and observed data $X$. \n",
835
835
"\n",
@@ -844,7 +844,7 @@
844
844
"cell_type": "markdown",
845
845
"metadata": {},
846
846
"source": [
847
-
"Suppose 100 students are being surveyed for cheating, and we wish to find $p$, the proportion of cheaters. There a few ways we can model this in PyMC. I'll demonstrate the most explict way, and later show a simplified version. Both versions arrive at the same inference. In our data-generation model, we sample $p$, the true proportion of cheaters, from a prior. Since we are quite ignorant about $p$, we will assign it a $\\text{Uniform}(0,1)$ prior."
847
+
"Suppose 100 students are being surveyed for cheating, and we wish to find $p$, the proportion of cheaters. There a few ways we can model this in PyMC. I'll demonstrate the most explicit way, and later show a simplified version. Both versions arrive at the same inference. In our data-generation model, we sample $p$, the true proportion of cheaters, from a prior. Since we are quite ignorant about $p$, we will assign it a $\\text{Uniform}(0,1)$ prior."
848
848
]
849
849
},
850
850
{
@@ -1115,9 +1115,9 @@
1115
1115
"cell_type": "markdown",
1116
1116
"metadata": {},
1117
1117
"source": [
1118
-
"I could have typed `p_skewed = 0.5*p + 0.25` instead for a one-liner, as the elementary operations of addition and scalar multiplication will implicity create a `deterministic` variable, but I wanted to make the deterministic boilerplate explicit for clarity's sake. \n",
1118
+
"I could have typed `p_skewed = 0.5*p + 0.25` instead for a one-liner, as the elementary operations of addition and scalar multiplication will implicitly create a `deterministic` variable, but I wanted to make the deterministic boilerplate explicit for clarity's sake. \n",
1119
1119
"\n",
1120
-
"If we know the probability of respondents saying \"Yes\", which is `p_skewed`, and we have $N=100$ students, the number of \"Yes\" responses is a binomial random variable with paramters `N` and `p_skewed`.\n",
1120
+
"If we know the probability of respondents saying \"Yes\", which is `p_skewed`, and we have $N=100$ students, the number of \"Yes\" responses is a binomial random variable with parameters `N` and `p_skewed`.\n",
1121
1121
"\n",
1122
1122
"This is were we include our observed 35 \"Yes\" responses. In the declaration of the `mc.Binomial`, we include `value = 35` and `observed = True`."
1123
1123
]
@@ -1269,7 +1269,7 @@
1269
1269
" alpha = 0.5) \n",
1270
1270
"plt.yticks([0,1])\n",
1271
1271
"plt.ylabel(\"Damage Incident?\")\n",
1272
-
"plt.xlabel(\"Outside temperature (Farhenhit)\" )\n",
1272
+
"plt.xlabel(\"Outside temperature (Fahrenheit)\" )\n",
1273
1273
"plt.title(\"Defects of the Space Shuttle O-Rings vs temperature\");\n"
1274
1274
],
1275
1275
"language": "python",
@@ -1323,7 +1323,7 @@
1323
1323
"cell_type": "markdown",
1324
1324
"metadata": {},
1325
1325
"source": [
1326
-
"It looks clear that *the probability* of damage incidents occuring increases as the outside temperature decreases. We are interested in modeling the probability here because it does not look like there is a strict cutoff point between temperature and a damage incident ocurring. The best we can do is ask \"At temperature $t$, what is the probability of a damage incident?\". The goal of this example is to answer that question.\n",
1326
+
"It looks clear that *the probability* of damage incidents occurring increases as the outside temperature decreases. We are interested in modeling the probability here because it does not look like there is a strict cutoff point between temperature and a damage incident occurring. The best we can do is ask \"At temperature $t$, what is the probability of a damage incident?\". The goal of this example is to answer that question.\n",
1327
1327
"\n",
1328
1328
"We need a function of temperature, call it $p(t)$, that is bounded between 0 and 1 (so as to model a probability) and changes from 1 to 0 as we increase temperature. There are actually many such functions, but the most popular choice is the *logistic function.*\n",
"where $p(t)$ is our logistic function and $t_i$ are the temperatures we have observations about. Notice in the above code we had to set the values of `beta` and `alpha` to 0. The reason for this is that if `beta` and `alpha` are very large, they make `p` equal to 1 or 0. Unfortunately, `mc.Bernoulli` does not like probabilities of exactly 0 or 1, though they are mathematically well-defined probabilties. So by setting the coefficient values to `0`, we set the variable `p` to be a reasonable starting value. This has no effect on our results, nor does it mean we are including any additional information in our prior. It is simply a computational caveat in PyMC. "
1509
+
"where $p(t)$ is our logistic function and $t_i$ are the temperatures we have observations about. Notice in the above code we had to set the values of `beta` and `alpha` to 0. The reason for this is that if `beta` and `alpha` are very large, they make `p` equal to 1 or 0. Unfortunately, `mc.Bernoulli` does not like probabilities of exactly 0 or 1, though they are mathematically well-defined probabilities. So by setting the coefficient values to `0`, we set the variable `p` to be a reasonable starting value. This has no effect on our results, nor does it mean we are including any additional information in our prior. It is simply a computational caveat in PyMC. "
1510
1510
]
1511
1511
},
1512
1512
{
@@ -1716,7 +1716,7 @@
1716
1716
"source": [
1717
1717
"### What about the day of the Challenger disaster?\n",
1718
1718
"\n",
1719
-
"On the day of the Challenger disaster, the outside temperature was 31 degrees fahrenheit. What is the posterior distribution of a defect occuring, given this temperature? The distribution is plotted below. It looks almost guaranteed that the Challenger was going to be subject to defective O-rings."
1719
+
"On the day of the Challenger disaster, the outside temperature was 31 degrees Fahrenheit. What is the posterior distribution of a defect occurring, given this temperature? The distribution is plotted below. It looks almost guaranteed that the Challenger was going to be subject to defective O-rings."
1720
1720
]
1721
1721
},
1722
1722
{
@@ -1748,11 +1748,11 @@
1748
1748
"source": [
1749
1749
"### Is our model appropriate?\n",
1750
1750
"\n",
1751
-
"The skeptical reader will say \"You deliberately chose the logistic function for $p(t)$ and the specific priors. Perhaps other functions or priors will give different results. How do I know I have chosen a good model?\" This is absolutely true. To consider an extreme situation, what if I had chosen the function $p(t) = 1,\\; \\forall t$, which guarantees a defect always occurring: I would have again predicted disaster on January 28th. Yet this is clearly a poorly chosen model. On the other hand, if I did choose the logistic function for $p(t)$, but specificed all my priors to be very tight around 0, likely we would have very different posterior distributions. How do we know our model is an expression of the data? This encourages us to measure the model's **goodness of fit**.\n",
1751
+
"The skeptical reader will say \"You deliberately chose the logistic function for $p(t)$ and the specific priors. Perhaps other functions or priors will give different results. How do I know I have chosen a good model?\" This is absolutely true. To consider an extreme situation, what if I had chosen the function $p(t) = 1,\\; \\forall t$, which guarantees a defect always occurring: I would have again predicted disaster on January 28th. Yet this is clearly a poorly chosen model. On the other hand, if I did choose the logistic function for $p(t)$, but specified all my priors to be very tight around 0, likely we would have very different posterior distributions. How do we know our model is an expression of the data? This encourages us to measure the model's **goodness of fit**.\n",
1752
1752
"\n",
1753
-
"We can think: *how can we test whether our model is a bad fit?* An idea is to compare observed data (which if we recall is a *fixed* stochastic variable) with artifical dataset which we can simulate. The rational is that if the simulated dataset does not appear similar, statistically, to the observed dataset, then likely our model is not accurately represented the observed data. \n",
1753
+
"We can think: *how can we test whether our model is a bad fit?* An idea is to compare observed data (which if we recall is a *fixed* stochastic variable) with artificial dataset which we can simulate. The rational is that if the simulated dataset does not appear similar, statistically, to the observed dataset, then likely our model is not accurately represented the observed data. \n",
1754
1754
"\n",
1755
-
"Previously in this Chapter, we simulated artifical dataset for the SMS example. To do this, we sampled values from the priors. We saw how varied the resulting datasets looked like, and rarely did they mimic our observed dataset. In the current example, we should sample from the *posterior* distributions to create *very plausible datasets*. Luckily, our Bayesian framework makes this very easy. We only need to create a new `Stochastic` variable, that is exactly the same as our variable that stored the observations, but minus the observations themselves. If you recall, our `Stochastic` variable that stored our observed data was:\n",
1755
+
"Previously in this Chapter, we simulated artificial dataset for the SMS example. To do this, we sampled values from the priors. We saw how varied the resulting datasets looked like, and rarely did they mimic our observed dataset. In the current example, we should sample from the *posterior* distributions to create *very plausible datasets*. Luckily, our Bayesian framework makes this very easy. We only need to create a new `Stochastic` variable, that is exactly the same as our variable that stored the observations, but minus the observations themselves. If you recall, our `Stochastic` variable that stored our observed data was:\n",
1756
1756
"\n",
1757
1757
" observed = mc.Bernoulli( \"bernoulli_obs\", p, value = D, observed=True)\n",
1758
1758
"\n",
@@ -2006,7 +2006,7 @@
2006
2006
"plt.title(\"Temperature-dependent model\")\n",
2007
2007
"\n",
2008
2008
"# perfect model\n",
2009
-
"# i.e. the probability of defect is equal to if a defect occured or not.\n",
2009
+
"# i.e. the probability of defect is equal to if a defect occurred or not.\n",
0 commit comments