|
769 | 769 | "\n",
|
770 | 770 | "As this is a hacker book, we'll continue with the web-dev example. For the moment, we will focus on the analysis of site A only. Assume that there is a true $0 \\lt p_A \\gt 1$ probability that users who, upon shown site A, eventually purchase from the site. This is the true effectiveness of site A. Unfortunately, this quantity is unknown to us. \n",
|
771 | 771 | "\n",
|
772 |
| - "Suppose site A was shown to $N$ people, and $n$ people purchased from the site. The *observed frequency* $\\frac{n}{N}$ does not necessarily equal $p_A$ -- there is randomness in the observed frequency. We are interested in using what we know, $N$ and $n$, to estimate what $p_A$ might be. To make this more concrete, consider an example where $p_A = 0.05$, $N = 1500$ users shown site A, and we will simulate whether the user made a purchase or not. To simulate this from $N$ trials, we will use a *Bernoulli* distribution: if $ X\\ \\sim \\text{Ber}(p)$, then $X$ is 1 with probability $p$ and 0 with probability $1$." |
| 772 | + "Suppose site A was shown to $N$ people, and $n$ people purchased from the site. The *observed frequency* $\\frac{n}{N}$ does not necessarily equal $p_A$ -- there is a difference between the *observed frequency* and the *true frequency* of an event. The true frequency can be interpreted as the probability of an event occurring. For example, the true frequency of rolling a 1 on a 6-sided die is 0.166. Knowing the frequency of events like fraction of users who make purchases, frequency of social attributes, percent of internet users with cats etc. are common requests we ask of Nature. Unfortunately, in general Nature hides the true frequency from us and we must *infer* it from observed data.\n", |
| 773 | + "\n", |
| 774 | + "The *observed frequency* is then the frequency we observe: say rolling the die 100 times you may observe 20 rolls of 1. The observed frequency, 0.2, differs from the true frequency, 0.166. We can use Bayesian statistics to infer probable values of the true frequency using an appropriate prior and observed data.\n", |
| 775 | + "\n", |
| 776 | + "\n", |
| 777 | + "With respect to our A/B example, we are interested in using what we know, $N$ and $n$, to estimate what $p_A$, the true frequency of buyers, might be. To make this more concrete, consider an example where $p_A = 0.05$, $N = 1500$ users shown site A, and we will simulate whether the user made a purchase or not. To simulate this from $N$ trials, we will use a *Bernoulli* distribution: if $ X\\ \\sim \\text{Ber}(p)$, then $X$ is 1 with probability $p$ and 0 with probability $1$." |
773 | 778 | ]
|
774 | 779 | },
|
775 | 780 | {
|
|
1109 | 1114 | "source": [
|
1110 | 1115 | "## An algorithm for human deceit\n",
|
1111 | 1116 | "\n",
|
1112 |
| - "Likely the most common statistical task is estimating the frequency of events. However, there is a difference between the *observed frequency* and the *true frequency* of an event. The true frequency can be interpreted as the probability of an event occurring. For example, the true frequency of rolling a 1 on a 6-sided die is 0.166. Knowing the frequency of events like baseball home runs, frequency of social attributes, fraction of internet users with cats etc. are common requests we ask of Nature. Unfortunately, in general Nature hides the true frequency from us and we must *infer* it from observed data.\n", |
1113 |
| - "\n", |
1114 |
| - "The *observed frequency* is then the frequency we observe: say rolling the die 100 times you may observe 20 rolls of 1. The observed frequency, 0.2, differs from the true frequency, 0.166. We can use Bayesian statistics to infer probable values of the true frequency using an appropriate prior and observed data.\n", |
1115 |
| - "\n", |
1116 |
| - "Social data is really interesting as people are not always honest with responses, which adds a further complication into inference. For example, simply asking individuals \"Have you ever cheated on a test?\" will surely contain some rate of dishonesty. What you can say for certain is that the true rate is less than your observed rate (assuming individuals lie *only* about *not cheating*; I cannot imagine one who would admit \"Yes\" to cheating when in fact they hadn't cheated). \n", |
| 1117 | + "Social data is has an additional layer of interest as people are not always honest with responses, which adds a further complication into inference. For example, simply asking individuals \"Have you ever cheated on a test?\" will surely contain some rate of dishonesty. What you can say for certain is that the true rate is less than your observed rate (assuming individuals lie *only* about *not cheating*; I cannot imagine one who would admit \"Yes\" to cheating when in fact they hadn't cheated). \n", |
1117 | 1118 | "\n",
|
1118 | 1119 | "To present an elegant solution to circumventing this dishonesty problem, and to demonstrate Bayesian modeling, we first need to introduce the binomial distribution.\n",
|
1119 | 1120 | "\n",
|
|
0 commit comments