|
72 | 72 | "\n",
|
73 | 73 | "Below is a diagram of the Law of Large numbers in action for three different sequences of Poisson random variables. \n",
|
74 | 74 | "\n",
|
75 |
| - " We sample `sample_size= 100000` poisson random variables with parameter $\\lambda = 4.5$. (Recall the expected value of a Poisson random variable is equal to it's parameter.) We calculate the average for the first $n$ samples, for $n=1$ to `sample_size`. " |
| 75 | + " We sample `sample_size= 100000` Poisson random variables with parameter $\\lambda = 4.5$. (Recall the expected value of a Poisson random variable is equal to it's parameter.) We calculate the average for the first $n$ samples, for $n=1$ to `sample_size`. " |
76 | 76 | ]
|
77 | 77 | },
|
78 | 78 | {
|
|
224 | 224 | "\n",
|
225 | 225 | "$$ \\frac{1}{N} \\sum_{i=1}^N \\mathbb{1}_A(X_i) \\rightarrow E[\\mathbb{1}_A(X)] = P(A) $$\n",
|
226 | 226 | "\n",
|
227 |
| - "Again, this is fairly obvious after a moments thought: the indicator function is only 1 if the event occurs, so we are summing only the times the event occurs and dividing by the total number of trials (consider how we usually approximate probablities using frequencies). For example, suppose we wish to estimate the probability that a $Z \\sim Exp(.5)$ is greater than 10, and we have many samples from a $Exp(.5)$ distribution. \n", |
| 227 | + "Again, this is fairly obvious after a moments thought: the indicator function is only 1 if the event occurs, so we are summing only the times the event occurs and dividing by the total number of trials (consider how we usually approximate probabilities using frequencies). For example, suppose we wish to estimate the probability that a $Z \\sim Exp(.5)$ is greater than 10, and we have many samples from a $Exp(.5)$ distribution. \n", |
228 | 228 | "\n",
|
229 | 229 | "\n",
|
230 | 230 | "$$ P( Z > 10 ) = \\sum_{i=1}^N \\mathbb{1}_{z > 10 }(Z_i) $$\n"
|
|
259 | 259 | "### What does this all have to do with Bayesian statistics? \n",
|
260 | 260 | "\n",
|
261 | 261 | "\n",
|
262 |
| - "*Point estimates*, to be introduced in the next chapter, in Bayesian inference are computed using expected values. In more analytical Bayesian inference, we would have been required to evaluate complicated expected values represented as multi-dimensional integrals. No longer. If we can sample from the posterior distibution directly, we simply need to evaluate averages. Much easier. If accuracy is a priority, plots like the ones above show how fast you are converging. And if further accuracy is desired, just take more samples from the posterior. \n", |
| 262 | + "*Point estimates*, to be introduced in the next chapter, in Bayesian inference are computed using expected values. In more analytical Bayesian inference, we would have been required to evaluate complicated expected values represented as multi-dimensional integrals. No longer. If we can sample from the posterior distribution directly, we simply need to evaluate averages. Much easier. If accuracy is a priority, plots like the ones above show how fast you are converging. And if further accuracy is desired, just take more samples from the posterior. \n", |
263 | 263 | "\n",
|
264 |
| - "When is enough enough? When can you stop drawing samples from the posterior? That is the practioners decision, and also dependent on the variance of the samples (recall from above a high variance means the average will converge slower). \n", |
| 264 | + "When is enough enough? When can you stop drawing samples from the posterior? That is the practitioners decision, and also dependent on the variance of the samples (recall from above a high variance means the average will converge slower). \n", |
265 | 265 | "\n",
|
266 | 266 | "We also should understand when the Law of Large Numbers fails. As the name implies, and comparing the graphs above for small $N$, the Law is only true for large sample sizes. Without this, the asymptotic result is not reliable. Knowing in what situations the Law fails can give use *confidence in how unconfident we should be*. The next section deals with this issue."
|
267 | 267 | ]
|
|
272 | 272 | "source": [
|
273 | 273 | "## The Disorder of Small Numbers\n",
|
274 | 274 | "\n",
|
275 |
| - "The Law of Large Numbers is only valid as $N$ gets *infinitely* large: never truely attainable. While the law is a powerful tool, it is foolhardy to apply it liberally. Our next example illustrates this.\n", |
| 275 | + "The Law of Large Numbers is only valid as $N$ gets *infinitely* large: never truly attainable. While the law is a powerful tool, it is foolhardy to apply it liberally. Our next example illustrates this.\n", |
276 | 276 | "\n",
|
277 | 277 | "\n",
|
278 | 278 | "##### Example: Aggregated geographic data\n",
|
279 | 279 | "\n",
|
280 | 280 | "\n",
|
281 |
| - "Often data comes in aggregated form. For instance, data may be grouped by state, county, or city level. Of course, the population numbers vary per geographic area. If the data is an average of some characteristic of each the geographic areas, we must be concious of the Law of Large Numbers and how it can *fail* for areas with small populations.\n", |
| 281 | + "Often data comes in aggregated form. For instance, data may be grouped by state, county, or city level. Of course, the population numbers vary per geographic area. If the data is an average of some characteristic of each the geographic areas, we must be conscious of the Law of Large Numbers and how it can *fail* for areas with small populations.\n", |
282 | 282 | "\n",
|
283 |
| - "We will observe this on a toy dataset. Suppose there are five thousand counties in our dataset. Furthermore, population number in each state are uniformly distributed between 100 and 1500. The way the population numbers are generated is irrelevant to the discussion, so we do not justify this. We are interested in measuring the average height of individuals per county. Unbeknowst to the us, height does **not** vary across county, and each individual, regardless of the county he or she is currently living in, has the same distribution of what their height may be:\n", |
| 283 | + "We will observe this on a toy dataset. Suppose there are five thousand counties in our dataset. Furthermore, population number in each state are uniformly distributed between 100 and 1500. The way the population numbers are generated is irrelevant to the discussion, so we do not justify this. We are interested in measuring the average height of individuals per county. Unbeknownst to the us, height does **not** vary across county, and each individual, regardless of the county he or she is currently living in, has the same distribution of what their height may be:\n", |
284 | 284 | "\n",
|
285 | 285 | "$$ \\text{height} \\sim \\text{Normal}(150, 15 ) $$\n",
|
286 | 286 | "\n",
|
|
350 | 350 | "cell_type": "markdown",
|
351 | 351 | "metadata": {},
|
352 | 352 | "source": [
|
353 |
| - "What do we observe? *Without accounting for population sizes* we run the risk of making an enourmous inference error: if we ignored population size, we would say that the county with the shortest and tallest individuals have been correctly circled. But this inference is wrong for the following reason. These two counties do *not* necessarily have the most extreme heights. The error is that the calculated average of the small population is not a good reflection of the true expected value of the population (which should be $\\mu =150$). The sample size/population size/$N$, whatever you wish to call it, is simply too small to invoke the Law of Large Numbers effectively. \n", |
| 353 | + "What do we observe? *Without accounting for population sizes* we run the risk of making an enormous inference error: if we ignored population size, we would say that the county with the shortest and tallest individuals have been correctly circled. But this inference is wrong for the following reason. These two counties do *not* necessarily have the most extreme heights. The error is that the calculated average of the small population is not a good reflection of the true expected value of the population (which should be $\\mu =150$). The sample size/population size/$N$, whatever you wish to call it, is simply too small to invoke the Law of Large Numbers effectively. \n", |
354 | 354 | "\n",
|
355 | 355 | "We provide more damning evidence against this inference. Recall the population numbers were uniformly distributed over 100 to 1500. Our intuition should tell us that the counties with the most extreme population heights should also be uniformly spread over 100 to 4000, and certainly independent of the county's population. Not so. Below are the population sizes of the counties with the most extreme heights."
|
356 | 356 | ]
|
|
390 | 390 | "\n",
|
391 | 391 | "##### Example: Kaggle's *U.S. Census Return Rate Challenge*\n",
|
392 | 392 | "\n",
|
393 |
| - "Below is data from the 2010 US census, which partitions populations beyond counties to the level of block groups (which are aggregates of city blocks or equivilants). The dataset is from a Kaggle machine learning competition some collegues and I participated in. The objective was to predict the census letter mail-back rate of a group block, measured between 0 and 100, using census variables (median income, number of females in the block-group, number of trailer parks, average number of children etc.). Below we plot the census mail-back rate versus block group population:" |
| 393 | + "Below is data from the 2010 US census, which partitions populations beyond counties to the level of block groups (which are aggregates of city blocks or equivalents). The dataset is from a Kaggle machine learning competition some colleagues and I participated in. The objective was to predict the census letter mail-back rate of a group block, measured between 0 and 100, using census variables (median income, number of females in the block-group, number of trailer parks, average number of children etc.). Below we plot the census mail-back rate versus block group population:" |
394 | 394 | ]
|
395 | 395 | },
|
396 | 396 | {
|
|
433 | 433 | "cell_type": "markdown",
|
434 | 434 | "metadata": {},
|
435 | 435 | "source": [
|
436 |
| - "The above is a classic phenonmenon in statistics. I say *classic* referring to the \"shape\" of the scatter plot above. It follows a classic triangular form, that tightens as we increase the sample size (as the Law of Large Numbers becomes more exact). \n", |
| 436 | + "The above is a classic phenomenon in statistics. I say *classic* referring to the \"shape\" of the scatter plot above. It follows a classic triangular form, that tightens as we increase the sample size (as the Law of Large Numbers becomes more exact). \n", |
437 | 437 | "\n",
|
438 |
| - "I am perhaps overstressing the point and maybe I should have titled the book *\"You don't have big data problems!\"*, but here again is an example of the trouble with *small datasets*, not big ones. Simply, small datasets cannot be processed using the Law of Large Numbers. Compare with applying the Law without hassle to big datasets (ex. big data). I mentioned earlier that paradoxically big data prediction problems are solved by relatively simple algorithms. The paradox is partially resolved by understanding that the Law of Large Numbers creates solutions that are *stable*, i.e. adding or substracting a few data points will not affect the solution much. On the other hand, adding or removing data points to a small dataset can create very different results. \n", |
| 438 | + "I am perhaps overstressing the point and maybe I should have titled the book *\"You don't have big data problems!\"*, but here again is an example of the trouble with *small datasets*, not big ones. Simply, small datasets cannot be processed using the Law of Large Numbers. Compare with applying the Law without hassle to big datasets (ex. big data). I mentioned earlier that paradoxically big data prediction problems are solved by relatively simple algorithms. The paradox is partially resolved by understanding that the Law of Large Numbers creates solutions that are *stable*, i.e. adding or subtracting a few data points will not affect the solution much. On the other hand, adding or removing data points to a small dataset can create very different results. \n", |
439 | 439 | "\n",
|
440 | 440 | "For further reading on the hidden dangers of the Law of Large Numbers, I would highly recommend the excellent manuscript [The Most Dangerous Equation](http://nsm.uh.edu/~dgraur/niv/TheMostDangerousEquation.pdf). "
|
441 | 441 | ]
|
|
446 | 446 | "source": [
|
447 | 447 | "##### Example: How Reddits ranks comments\n",
|
448 | 448 | "\n",
|
449 |
| - "You may have disagreed with the original statement that the Law of Large numbers is known to everyone, but only implicity in our subconcious decision making. Consider ratings on online products: how often do you trust an average 5-star rating if there is only 1 reviewer? 2 reviewers? 3 reviewers? We implicitly understand that with such few reviewers that the average rating is **not** a good reflection of the true value of the product.\n", |
| 449 | + "You may have disagreed with the original statement that the Law of Large numbers is known to everyone, but only implicitly in our subconscious decision making. Consider ratings on online products: how often do you trust an average 5-star rating if there is only 1 reviewer? 2 reviewers? 3 reviewers? We implicitly understand that with such few reviewers that the average rating is **not** a good reflection of the true value of the product.\n", |
450 | 450 | "\n",
|
451 | 451 | "This has created flaws in how we sort items, and more generally, how we compare items. Many people have realized that sorting online search results by their rating, whether the objects be books, videos, or online comments, return poor results. Often the seemingly top videos or comments have perfect ratings only from a few enthusiastic fans, and truly more quality videos or comments are hidden in later pages with *falsely-substandard* ratings of around 4.8. How can we correct this?\n",
|
452 | 452 | "\n",
|
|
474 | 474 | "source": [
|
475 | 475 | "One way to determine a prior on the upvote ratio is that look at the historical distribution of upvote ratios. This can be accomplished by scrapping Reddit's comments and determining a distribution. There are a few problems with this technique though:\n",
|
476 | 476 | "\n",
|
477 |
| - "1. Skewed data: The vast majority of comments have very few votes, hence there will be many comments with ratios near the extremes (see the \"triangular plot\" in the above Kaggle dataset), effectivly skewing our distribution to the extremes. One could try to only use comments with votes greater than some threshold. Again, problems are encountered. There is a tradeoff between number of comments available to use and a higher threshold with associated ratio precision. \n", |
478 |
| - "2. Biased data: Reddit is composed of different subpages, called subreddits. Two exampes are *r/aww*, which posts pics of cute animals, and *r/politics*. It is very likely that the user behaviour towards comments of these two subreddits are very different: visitors are likely friend and affectionate in the former, and would therefore upvote comments more, compared to the latter, where comments are likely to be controversial and disagreed upon. Therefore not all comments are the same. \n", |
| 477 | + "1. Skewed data: The vast majority of comments have very few votes, hence there will be many comments with ratios near the extremes (see the \"triangular plot\" in the above Kaggle dataset), effectively skewing our distribution to the extremes. One could try to only use comments with votes greater than some threshold. Again, problems are encountered. There is a tradeoff between number of comments available to use and a higher threshold with associated ratio precision. \n", |
| 478 | + "2. Biased data: Reddit is composed of different subpages, called subreddits. Two examples are *r/aww*, which posts pics of cute animals, and *r/politics*. It is very likely that the user behaviour towards comments of these two subreddits are very different: visitors are likely friend and affectionate in the former, and would therefore upvote comments more, compared to the latter, where comments are likely to be controversial and disagreed upon. Therefore not all comments are the same. \n", |
479 | 479 | "\n",
|
480 | 480 | "\n",
|
481 | 481 | "In light of these, I think it is better to use a `Uniform` prior.\n",
|
|
577 | 577 | "cell_type": "markdown",
|
578 | 578 | "metadata": {},
|
579 | 579 | "source": [
|
580 |
| - " For a given true upvote ratio $p$ and $N$ votes, the number of upvotes will look like a Binomial random variable with parameters $p$ and $N$. (This is because of the equiviliance between upvote ratio and probability of upvoting versus downvoting, out of $N$ possible votes/trials). We create a function that performs Bayesian inference on $p$, for a particular comment's upvote/downvote pair." |
| 580 | + " For a given true upvote ratio $p$ and $N$ votes, the number of upvotes will look like a Binomial random variable with parameters $p$ and $N$. (This is because of the equivalence between upvote ratio and probability of upvoting versus downvoting, out of $N$ possible votes/trials). We create a function that performs Bayesian inference on $p$, for a particular comment's upvote/downvote pair." |
581 | 581 | ]
|
582 | 582 | },
|
583 | 583 | {
|
|
0 commit comments