repos-algorithms
diff --git a/‎Chapter3_MCMC/IntroMCMC.ipynb‎
Lines changed: 4 additions & 4 deletions b/‎Chapter3_MCMC/IntroMCMC.ipynb‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎Chapter4_TheGreatestTheoremNeverTold/LawOfLargeNumbers.ipynb‎
Lines changed: 7 additions & 7 deletions b/‎Chapter4_TheGreatestTheoremNeverTold/LawOfLargeNumbers.ipynb‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎Chapter5_LossFunctions/LossFunctions.ipynb‎
Lines changed: 6 additions & 6 deletions b/‎Chapter5_LossFunctions/LossFunctions.ipynb‎
Lines changed: 6 additions & 6 deletions
@@ -694,7 +694,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "Looking at the above plot, it appears that the most uncertainity is between 150 and 170. The above plot slightly misrepresents things, as the x-axis is not a true scale (it displays the value of the $i$th sorted data point.) A more clear diagram is below, where we have estimated the *frequency* of each data point belonging to the labels 0 and 1. "
+      "Looking at the above plot, it appears that the most uncertainty is between 150 and 170. The above plot slightly misrepresents things, as the x-axis is not a true scale (it displays the value of the $i$th sorted data point.) A more clear diagram is below, where we have estimated the *frequency* of each data point belonging to the labels 0 and 1. "
      ]
     },
     {
@@ -911,7 +911,7 @@
       "\n",
       "The `MAP.fit()` methods has the flexibility of allowing the user to choose which opimization algorithm to use (after all, this is a optimization problem: we are looking for the values that maximize our landscape), as not all optimization algorithms are created equal. The default optimization algorithm in the call to `fit` is scipy's `fmin` algorithm (which attemps to minimize the *negative of the landscape*). An alternative algorithm that is available is Powell's Method, a favourite of PyMC blogger [Abraham Flaxman](http://healthyalgorithms.com/) [1], by calling `fit(method='fmin_powell')`. From my experience, I use the default, but if my convergence is slow or not guaranteed, I experiment with Powell's method. \n",
       "\n",
-      "The MAP can also be used as a solution to the inference problem, as mathematically it  is the *most likely* value for the unknowns. But as mentioned earlier in this chapter,  this location ignores the uncertainity and doesn't return a distribution.\n",
+      "The MAP can also be used as a solution to the inference problem, as mathematically it  is the *most likely* value for the unknowns. But as mentioned earlier in this chapter,  this location ignores the uncertainty and doesn't return a distribution.\n",
       "\n",
       "Typically, it is always a good idea, and rarely a bad idea, to prepend your call to `mcmc` with a call to `MAP(model).fit()`. The intermediate call to `fit` is hardly computationally intensive, and will save you time later due to a shorter burn-in period. \n",
       "\n",
@@ -1153,7 +1153,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "The largest plot on the right-hand side is the histograms of the samples, plus a few extra features. The thickest vertical line represents the posterior mean, which is a good summary of posterior distribution. The interval between the two  dashed vertical lines in each the posterior distributions represent the *95% credible interval*, not to be confused with a *95% confidence interval*. I won't get into the latter, but the former can be interpreted as \"there is a 95% chance the parameter of interested lies in this interval\". (Changing default parameters in the call to `mcplot` provides alternatives to 95%.) When communicating your results to others, it is incredibly important to state this interval. One of our purposes for studying Bayesian methods is to have a clear understanding of our uncertainity in unknowns. Combined with the posterior mean, the 95% credible interval provides a reliable interval to communicate the likely location of the unknown (provided by the mean) *and* the uncertainty (represented by the width of the interval)."
+      "The largest plot on the right-hand side is the histograms of the samples, plus a few extra features. The thickest vertical line represents the posterior mean, which is a good summary of posterior distribution. The interval between the two  dashed vertical lines in each the posterior distributions represent the *95% credible interval*, not to be confused with a *95% confidence interval*. I won't get into the latter, but the former can be interpreted as \"there is a 95% chance the parameter of interested lies in this interval\". (Changing default parameters in the call to `mcplot` provides alternatives to 95%.) When communicating your results to others, it is incredibly important to state this interval. One of our purposes for studying Bayesian methods is to have a clear understanding of our uncertainty in unknowns. Combined with the posterior mean, the 95% credible interval provides a reliable interval to communicate the likely location of the unknown (provided by the mean) *and* the uncertainty (represented by the width of the interval)."
      ]
     },
     {
@@ -1300,4 +1300,4 @@
    "metadata": {}
   }
  ]
-}
+}
@@ -204,7 +204,7 @@
       "\n",
       "$$ \\frac{ \\sqrt{ \\; Var(Z) \\; } }{\\sqrt{N} }$$\n",
       "\n",
-      "This is useful to know: for a given large $N$, we know (on average) how far away we are from the estimate. On the other hand, in a Bayesian setting, this can seem like a useless result: Bayesian analysis is OK with uncertainity so what's the *statistical* point of adding extra precise digits? Though drawing samples can be so computationally cheap that having a *larger* $N$ is fine too. \n",
+      "This is useful to know: for a given large $N$, we know (on average) how far away we are from the estimate. On the other hand, in a Bayesian setting, this can seem like a useless result: Bayesian analysis is OK with uncertainty so what's the *statistical* point of adding extra precise digits? Though drawing samples can be so computationally cheap that having a *larger* $N$ is fine too. \n",
       "\n",
       "### How do we compute $Var(Z)$ though?\n",
       "\n",
@@ -475,7 +475,7 @@
       "One way to determine a prior on the upvote ratio is that look at the historical distribution of upvote ratios. This can be accomplished by scrapping Reddit's comments and determining a distribution. There are a few problems with this technique though:\n",
       "\n",
       "1. Skewed data:  The vast majority of comments have very few votes, hence there will be many comments with ratios near the extremes (see the \"triangular plot\" in the above Kaggle dataset), effectivly skewing our distribution to the extremes. One could try to only use comments with votes greater than some threshold. Again, problems are encountered. There is a tradeoff between number of comments available to use and a higher threshold with associated ratio precision. \n",
-      "2. Biased data: Reddit is composed of different subpages, called subreddits. Two exampes are *r/aww*, which posts pics of cute animals, and *r/polictics*. It is very likely that the user behaviour towards comments of these two subreddits are very different: visitors are likely friend and affectionate in the former, and would therefore upvote comments more, compared to the latter, where comments are likely to be controversial and disagreed upon. Therefore not all comments are the same. \n",
+      "2. Biased data: Reddit is composed of different subpages, called subreddits. Two exampes are *r/aww*, which posts pics of cute animals, and *r/politics*. It is very likely that the user behaviour towards comments of these two subreddits are very different: visitors are likely friend and affectionate in the former, and would therefore upvote comments more, compared to the latter, where comments are likely to be controversial and disagreed upon. Therefore not all comments are the same. \n",
       "\n",
       "\n",
       "In light of these, I think it is better to use a `Uniform` prior.\n",
@@ -632,7 +632,7 @@
       "    \n",
       "plt.legend(loc=\"upper left\")\n",
       "plt.xlim( 0, 1)\n",
-      "plt.title(\"Posterior distrbutions of upvote ratios on different comments\");\n",
+      "plt.title(\"Posterior distributions of upvote ratios on different comments\");\n",
       "\n"
      ],
      "language": "python",
@@ -664,11 +664,11 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "Some distributions are very tight, others have very long tails (relatively speaking), expressing our uncertainity with what the true upvote ratio might be.\n",
+      "Some distributions are very tight, others have very long tails (relatively speaking), expressing our uncertainty with what the true upvote ratio might be.\n",
       "\n",
       "### Sorting!\n",
       "\n",
-      "We have been ignoring the goal of this exercise: how do we sort the comments from *best to worst*? Of course, we cannot sort distributions, we must sort scalar numbers. There are many ways to distill a distribution down to a scalar: expressing the distribution through its expected value, or mean, is one way. Choosing the mean bad choice though. This is because the mean does not take into account the uncertainity of distributions.\n",
+      "We have been ignoring the goal of this exercise: how do we sort the comments from *best to worst*? Of course, we cannot sort distributions, we must sort scalar numbers. There are many ways to distill a distribution down to a scalar: expressing the distribution through its expected value, or mean, is one way. Choosing the mean bad choice though. This is because the mean does not take into account the uncertainty of distributions.\n",
       "\n",
       "I  suggest using the *95% least plausible value*, defined as the value such that there is only a 5% chance the true parameter is lower (think of the lower bound on the 95% credible region). Below are the posterior distributions with the 95% least-plausible value plotted:"
      ]
@@ -696,7 +696,7 @@
       "    \n",
       "plt.legend(loc=\"upper left\")\n",
       "\n",
-      "plt.title(\"Posterior distrbutions of upvote ratios on different comments\");\n",
+      "plt.title(\"Posterior distributions of upvote ratios on different comments\");\n",
       "order = argsort( -np.array( lower_limits ) )\n",
       "print order, lower_limits\n",
       "\n",
@@ -1171,4 +1171,4 @@
    "metadata": {}
   }
  ]
-}
+}
@@ -65,7 +65,7 @@
       "\n",
       "Historically, loss functions have been motivated from 1) mathematical convenience, and 2) they are robust to application, i.e., they are objective measures of loss. The first reason has really held back the full breadth of loss functions. With computers being agnogstic to mathematical convience, we are free to design our own loss functions, which we take full advantage of later in this Chapter.\n",
       "\n",
-      "With respect to the second point, the above loss functions are indeed objective, in that they are most often a function of the difference between estimate and true parameter, independent of signage or payoff of choosing that estimate. This last point, its independence of payoff, causes quite pathological results though. Consider our hurricane example above: the statistician equivalently predicted that the probability of the hurricane striking was between 0% to 1%. But if he had ignored being precise and instead focused on outcomes (99% change of no flood, 1% chance of flood), he might have advised differently. \n",
+      "With respect to the second point, the above loss functions are indeed objective, in that they are most often a function of the difference between estimate and true parameter, independent of signage or payoff of choosing that estimate. This last point, its independence of payoff, causes quite pathological results though. Consider our hurricane example above: the statistician equivalently predicted that the probability of the hurricane striking was between 0% to 1%. But if he had ignored being precise and instead focused on outcomes (99% chance of no flood, 1% chance of flood), he might have advised differently. \n",
       "\n",
       "By shifting our focus from trying to be incredibly precise about parameter estimation to focusing on the outcomes of our parameter estimation, we can customize our estimates to be optimized for our application. This requires us to design new loss functions that reflect our goals and outcomes. Some examples of more interesting loss functions:\n",
       "\n",
@@ -105,13 +105,13 @@
       "\n",
       "In Bayesian inference, we have a mindset that the unknown parameters are really random variables with prior and posterior distributions. Concerning the posterior distribution, a value drawn from it is a *possible* realization of what the true parameter could be. Given that realization, we can compute a loss associated with an estimate. As we have a whole distribution of what the unknown parameter could be (the posterior), we should be more interested in computing the *expected loss* given an estimate. This expected loss is a better estimate of the true loss than comparing the given loss from only a single sample from the posterior.\n",
       "\n",
-      "First it will be useful to explain a *Bayesian point esimate*. The systems and machinery present in the modern world are not built to accept posterior distributions as input. It is also rude to hand someone over a distribution when all the asked for was an estimate.  In the course of an individual's day, when faced with uncertainty we still act by distilling our uncertainity down to a single action. Similarly, we need to distill our posterior distribution down to a single value (or vector in the multivariate case). If the value is chosen intelligently, we can avoid the flaw of frequentist methodologies that mask the uncertainity and provide a more informative result.The value chosen, if from a Bayesian posterior, is a Bayesian point estimate. \n",
+      "First it will be useful to explain a *Bayesian point esimate*. The systems and machinery present in the modern world are not built to accept posterior distributions as input. It is also rude to hand someone over a distribution when all the asked for was an estimate.  In the course of an individual's day, when faced with uncertainty we still act by distilling our uncertainty down to a single action. Similarly, we need to distill our posterior distribution down to a single value (or vector in the multivariate case). If the value is chosen intelligently, we can avoid the flaw of frequentist methodologies that mask the uncertainty and provide a more informative result.The value chosen, if from a Bayesian posterior, is a Bayesian point estimate. \n",
       "\n",
       "Suppose $P(\\theta | X)$ is the posterior distribution of $\\theta$ after observing data $X$, then the following function is understandable as the *expected loss of choosing estimate $\\hat{\\theta}$ to estimate $\\theta$*:\n",
       "\n",
       "$$ l(\\hat{\\theta} ) = E_{\\theta}\\left[ \\; L(\\theta, \\hat{\\theta}) \\; \\right] $$\n",
       "\n",
-      "This is also known as the *risk* of estimate $\\hat{\\theta}$. The subscript $\\theta$ under the expectation symbol is used to denote that $\\theta$ is the unknown (random) variable in the expectation, something that at first can difficult to consider.\n",
+      "This is also known as the *risk* of estimate $\\hat{\\theta}$. The subscript $\\theta$ under the expectation symbol is used to denote that $\\theta$ is the unknown (random) variable in the expectation, something that at first can be difficult to consider.\n",
       "\n",
       "We spent all of last chapter discussing how to approximate expected values. Given $N$ samples $\\theta_i,\\; i=1,...,N$ from the posterior distribution, and a loss function $L$, we can approximate the expected loss of using estimate $\\hat{\\theta}$ by the Law of Large Numbers:\n",
       "\n",
@@ -522,7 +522,7 @@
       "\n",
       "### Machine Learning via Bayesian Methods\n",
       "\n",
-      "Whereas frequentist methods strive to achieve the best precision amount all possible parameters, machine learning cares to acheive the best *prediction* among all possible parameters. Of course, one way to achieve accurate predictions is to aim for accurate predictions, but often your prediction measure and what frequentist methods are optimizing for are very different. \n",
+      "Whereas frequentist methods strive to achieve the best precision about all possible parameters, machine learning cares to achieve the best *prediction* among all possible parameters. Of course, one way to achieve accurate predictions is to aim for accurate predictions, but often your prediction measure and what frequentist methods are optimizing for are very different. \n",
       "\n",
       "For example, least-squares linear regression is the most simple active machine learning algorithm. I say active as it engages in some learning, whereas predicting the sample mean is technically *simplier*, but is learning very little if anything. The loss that determines the coefficients of the regressors is a squared-error loss. On the other hand, if your prediction loss function (or score function, which is the negative loss) is not a squared-error, like AUC, ROC, precision, etc., your least-squares line will not be optimal for the prediction loss function. This can lead to prediction results that are suboptimal. \n",
       "\n",
@@ -822,7 +822,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "What is interesting about the above graph is that when the signal is near 0, and many of the possible  returns outcomes are possibly both positive and negative, our best (with respect to our loss) prediction is to predict close to 0, hence *take on no position*. Only when we are very confident do we enter into a position. I call this style of model a *sparse prediction*, where we feel uncomfortable with our uncertainity so choose not to act. (Compare with the least-squares prediction which will rarely, if ever, predict zero). \n",
+      "What is interesting about the above graph is that when the signal is near 0, and many of the possible  returns outcomes are possibly both positive and negative, our best (with respect to our loss) prediction is to predict close to 0, hence *take on no position*. Only when we are very confident do we enter into a position. I call this style of model a *sparse prediction*, where we feel uncomfortable with our uncertainty so choose not to act. (Compare with the least-squares prediction which will rarely, if ever, predict zero). \n",
       "\n",
       "A good sanity check that our model is still reasonable: as the signal becomes more and more extreme, and we feel more and more confident about the positive/negativeness of returns, our position converges with that of the least-squares line. \n",
       "\n",
@@ -1625,4 +1625,4 @@
    "metadata": {}
   }
  ]
-}
+}
Original file line number	Diff line number	Diff line change
`@@ -694,7 +694,7 @@`
`694`	`694`	`"cell_type": "markdown",`
`695`	`695`	`"metadata": {},`
`696`	`696`	`"source": [`
`697`		`- "Looking at the above plot, it appears that the most uncertainity is between 150 and 170. The above plot slightly misrepresents things, as the x-axis is not a true scale (it displays the value of the $i$th sorted data point.) A more clear diagram is below, where we have estimated the frequency of each data point belonging to the labels 0 and 1. "`
	`697`	`+ "Looking at the above plot, it appears that the most uncertainty is between 150 and 170. The above plot slightly misrepresents things, as the x-axis is not a true scale (it displays the value of the $i$th sorted data point.) A more clear diagram is below, where we have estimated the frequency of each data point belonging to the labels 0 and 1. "`
`698`	`698`	`]`
`699`	`699`	`},`
`700`	`700`	`{`
`@@ -911,7 +911,7 @@`
`911`	`911`	`"\n",`
`912`	`912`	"The `MAP.fit()` methods has the flexibility of allowing the user to choose which opimization algorithm to use (after all, this is a optimization problem: we are looking for the values that maximize our landscape), as not all optimization algorithms are created equal. The default optimization algorithm in the call to `fit` is scipy's `fmin` algorithm (which attemps to minimize the negative of the landscape). An alternative algorithm that is available is Powell's Method, a favourite of PyMC blogger [Abraham Flaxman](http://healthyalgorithms.com/) [1], by calling `fit(method='fmin_powell')`. From my experience, I use the default, but if my convergence is slow or not guaranteed, I experiment with Powell's method. \n",
`913`	`913`	`"\n",`
`914`		`- "The MAP can also be used as a solution to the inference problem, as mathematically it is the most likely value for the unknowns. But as mentioned earlier in this chapter, this location ignores the uncertainity and doesn't return a distribution.\n",`
	`914`	`+ "The MAP can also be used as a solution to the inference problem, as mathematically it is the most likely value for the unknowns. But as mentioned earlier in this chapter, this location ignores the uncertainty and doesn't return a distribution.\n",`
`915`	`915`	`"\n",`
`916`	`916`	"Typically, it is always a good idea, and rarely a bad idea, to prepend your call to `mcmc` with a call to `MAP(model).fit()`. The intermediate call to `fit` is hardly computationally intensive, and will save you time later due to a shorter burn-in period. \n",
`917`	`917`	`"\n",`
`@@ -1153,7 +1153,7 @@`
`1153`	`1153`	`"cell_type": "markdown",`
`1154`	`1154`	`"metadata": {},`
`1155`	`1155`	`"source": [`
`1156`		- "The largest plot on the right-hand side is the histograms of the samples, plus a few extra features. The thickest vertical line represents the posterior mean, which is a good summary of posterior distribution. The interval between the two dashed vertical lines in each the posterior distributions represent the 95% credible interval, not to be confused with a 95% confidence interval. I won't get into the latter, but the former can be interpreted as \"there is a 95% chance the parameter of interested lies in this interval\". (Changing default parameters in the call to `mcplot` provides alternatives to 95%.) When communicating your results to others, it is incredibly important to state this interval. One of our purposes for studying Bayesian methods is to have a clear understanding of our uncertainity in unknowns. Combined with the posterior mean, the 95% credible interval provides a reliable interval to communicate the likely location of the unknown (provided by the mean) and the uncertainty (represented by the width of the interval)."
	`1156`	+ "The largest plot on the right-hand side is the histograms of the samples, plus a few extra features. The thickest vertical line represents the posterior mean, which is a good summary of posterior distribution. The interval between the two dashed vertical lines in each the posterior distributions represent the 95% credible interval, not to be confused with a 95% confidence interval. I won't get into the latter, but the former can be interpreted as \"there is a 95% chance the parameter of interested lies in this interval\". (Changing default parameters in the call to `mcplot` provides alternatives to 95%.) When communicating your results to others, it is incredibly important to state this interval. One of our purposes for studying Bayesian methods is to have a clear understanding of our uncertainty in unknowns. Combined with the posterior mean, the 95% credible interval provides a reliable interval to communicate the likely location of the unknown (provided by the mean) and the uncertainty (represented by the width of the interval)."
`1157`	`1157`	`]`
`1158`	`1158`	`},`
`1159`	`1159`	`{`
`@@ -1300,4 +1300,4 @@`
`1300`	`1300`	`"metadata": {}`
`1301`	`1301`	`}`
`1302`	`1302`	`]`
`1303`		`-}`
	`1303`	`+}`