Skip to content

Commit 13b5c21

Browse files
committed
Merge pull request sux13#12 from elmerehbi/master
Some more edits & additions
2 parents 4b171d2 + f383850 commit 13b5c21

File tree

2 files changed

+25
-16
lines changed

2 files changed

+25
-16
lines changed

6_STATINFERENCE/Statistical Inference Course Notes.Rmd

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -304,7 +304,7 @@ ggplot(dat, aes(x = x, y = y, color = factor)) + geom_line(size = 2)
304304
```
305305

306306

307-
* **variance** = measure of spread, the square of expected distance from the mean (expressed in $X$'s units$^2$)
307+
* **variance** = measure of spread or dispersion, the expected squared distance of the variable from its mean (expressed in $X$'s units$^2$)
308308
- as we can see from above, higher variances $\rightarrow$ more spread, lower $\rightarrow$ smaller spread
309309
* $Var(X) = E[(X-\mu)^2] = E[X^2] - E[X]^2$
310310
* **standard deviation** $= \sqrt{Var(X)}$ $\rightarrow$ has same units as X
@@ -352,7 +352,7 @@ grid.raster(readPNG("figures/8.png"))
352352
```
353353

354354
* **distribution for mean of random samples**
355-
* expected value of the **mean** of distribution of means = expected value of the sample = population mean
355+
* expected value of the **mean** of distribution of means = expected value of the sample mean = population mean
356356
* $E[\bar X]=\mu$
357357
* expected value of the variance of distribution of means
358358
* $Var(\bar X) = \sigma^2/n$
@@ -647,12 +647,12 @@ grid.arrange(g, p, ncol = 2)
647647

648648
### Example - CLT with Bernoulli Trials (Coin Flips)
649649
- for this example, we will simulate $n$ flips of a possibly unfair coin
650-
- $X_i$ be the 0 or 1 result of the $i^{th}$ flip of a possibly unfair coin
650+
- let $X_i$ be the 0 or 1 result of the $i^{th}$ flip of a possibly unfair coin
651651
+ sample proportion , $\hat p$, is the average of the coin flips
652652
+ $E[X_i] = p$ and $Var(X_i) = p(1-p)$
653653
+ standard error of the mean is $SE = \sqrt{p(1-p)/n}$
654654
+ in principle, normalizing the random variable $X_i$, we should get an approximately standard normal distribution $$\frac{\hat p - p}{\sqrt{p(1-p)/n}} \sim N(0,~1)$$
655-
- therefore, we will flip a coin $n$ times, take the sample proportion of heads (successes with probability $p$), subtract off 0.5 (ideal sample proportion) and multiply the result by divide by $\frac{1}{2 \sqrt{n}}$ and compare it to the standard normal
655+
- therefore, we will flip a coin $n$ times, take the sample proportion of heads (successes with probability $p$), subtract off 0.5 (ideal sample proportion) and multiply the result by $\frac{1}{2 \sqrt{n}}$ and compare it to the standard normal
656656

657657
```{r, echo = FALSE, fig.width=6, fig.height = 3, fig.align='center'}
658658
# specify number of simulations
@@ -711,7 +711,7 @@ g + facet_grid(. ~ size)
711711
* **95% confidence interval for the population mean $\mu$** is defined as $$\bar X \pm 2\sigma/\sqrt{n}$$ for the sample mean $\bar X \sim N(\mu, \sigma^2/n)$
712712
* you can choose to use 1.96 to be more accurate for the confidence interval
713713
* $P(\bar{X} > \mu + 2\sigma/\sqrt{n}~or~\bar{X} < \mu - 2\sigma/\sqrt{n}) = 5\%$
714-
* **interpretation**: if we were to repeated samples of size $n$ from the population and construct this confidence interval for each case, approximately 95% of the intervals will contain $\mu$
714+
* **interpretation**: if we were to repeatedly draw samples of size $n$ from the population and construct this confidence interval for each case, approximately 95% of the intervals will contain $\mu$
715715
* confidence intervals get **narrower** with less variability or
716716
larger sample sizes
717717
* ***Note**: Poisson and binomial distributions have exact intervals that don't require CLT *
@@ -729,9 +729,10 @@ mean(x) + c(-1, 1) * qnorm(0.975) * sd(x)/sqrt(length(x))
729729
### Confidence Interval - Bernoulli Distribution/Wald Interval
730730
* for Bernoulli distributions, $X_i$ is 0 or 1 with success probability $p$ and the variance is $\sigma^2 = p(1 - p)$
731731
* the confidence interval takes the form of $$\hat{p} \pm z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}$$
732-
* since the population proportion $p$ is unknown, we can use $\hat{p} = X/n$ as estimate
732+
* since the population proportion $p$ is unknown, we can use the sampled proportion of success $\hat{p} = X/n$ as estimate
733733
* $p(1-p)$ is largest when $p = 1/2$, so 95% confidence interval can be calculated by $$\begin{aligned}
734-
\hat{p} \pm Z_{0.95} \sqrt{\frac{0.5(1-0.5)}{n}} & = \hat{p} \pm 1.96 \sqrt{\frac{1}{4n}}\\
734+
\hat{p} \pm Z_{0.95} \sqrt{\frac{0.5(1-0.5)}{n}} & = \hat{p} \pm qnorm(.975) \sqrt{\frac{1}{4n}}\\
735+
& = \hat{p} \pm 1.96 \sqrt{\frac{1}{4n}}\\
735736
& = \hat{p} \pm \frac{1.96}{2} \sqrt{\frac{1}{n}}\\
736737
& \approx \hat{p} \pm \frac{1}{\sqrt{n}}\\
737738
\end{aligned}$$
@@ -948,6 +949,7 @@ t.test(g2, g1, paired = TRUE)
948949
* $S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}$ = standard error
949950
* $S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)$ = pooled variance estimator
950951
* this is effectively a weighted average between the two variances, such that different sample sizes are taken in to account
952+
* For equal sample sizes, $n_x = n_y$, $S_p^2 = \frac{S_x^2 + S_y^2}{2}$ (average of variance of two groups)
951953
* ***Note:** this interval assumes **constant variance** across two groups; if variance is different, use the next interval *
952954

953955
### Independent Group t Intervals - Different Variance
@@ -1001,7 +1003,7 @@ $H_a$ | $H_0$ | Type II error |
10011003

10021004
* **$\alpha$** = Type I error rate
10031005
* probability of ***rejecting*** the null hypothesis when the hypothesis is ***correct***
1004-
* $\alpha$ = 0.5 $\rightarrow$ standard for hypothesis testing
1006+
* $\alpha$ = 0.05 $\rightarrow$ standard for hypothesis testing
10051007
* ***Note**: as Type I error rate increases, Type II error rate decreases and vice versa *
10061008

10071009
* for large samples (large n), use the **Z Test** for $H_0:\mu = \mu_0$
@@ -1014,7 +1016,7 @@ $H_a$ | $H_0$ | Type II error |
10141016
* $H_1: TS \leq Z_{\alpha}$ OR $-Z_{1 - \alpha}$
10151017
* $H_2: |TS| \geq Z_{1 - \alpha / 2}$
10161018
* $H_3: TS \geq Z_{1 - \alpha}$
1017-
* ***Note**: In case of $\alpha$ = 0.5 (most common), $Z_{1-\alpha}$ = 1.645 (95 percentile) *
1019+
* ***Note**: In case of $\alpha$ = 0.05 (most common), $Z_{1-\alpha}$ = 1.645 (95 percentile) *
10181020
* $\alpha$ = low, so that when $H_0$ is rejected, original model $\rightarrow$ wrong or made an error (low probability)
10191021

10201022
* For small samples (small n), use the **T Test** for $H_0:\mu = \mu_0$
@@ -1027,7 +1029,7 @@ $H_a$ | $H_0$ | Type II error |
10271029
* $H_1: TS \leq T_{\alpha}$ OR $-T_{1 - \alpha}$
10281030
* $H_2: |TS| \geq T_{1 - \alpha / 2}$
10291031
* $H_3: TS \geq T_{1 - \alpha}$
1030-
* ***Note**: In case of $\alpha$ = 0.5 (most common), $T_{1-\alpha}$ = `qt(.95, df = n-1)` *
1032+
* ***Note**: In case of $\alpha$ = 0.05 (most common), $T_{1-\alpha}$ = `qt(.95, df = n-1)` *
10311033
* R commands for T test:
10321034
* `t.test(vector1 - vector2)`
10331035
* `t.test(vector1, vector2, paired = TRUE)`
@@ -1042,7 +1044,7 @@ $H_a$ | $H_0$ | Type II error |
10421044

10431045
* **two-sided tests** $\rightarrow$ $H_a: \mu \neq \mu_0$
10441046
* reject $H_0$ only if test statistic is too larger/small
1045-
* for $\alpha$ = 0.5, split equally to 2.5% for upper and 2.5% for lower tails
1047+
* for $\alpha$ = 0.05, split equally to 2.5% for upper and 2.5% for lower tails
10461048
* equivalent to $|TS| \geq T_{1 - \alpha / 2}$
10471049
* example: for T test, `qt(.975, df)` and `qt(.025, df)`
10481050
* ***Note**: failing to reject one-sided test = fail to reject two-sided*

8_PREDMACHLEARN/Practical Machine Learning Course Notes.Rmd

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -528,7 +528,7 @@ p2 <- qplot(cutWage,age, data=training,fill=cutWage,
528528
grid.arrange(p1,p2,ncol=2)
529529
```
530530

531-
* `table(cutVariable, data$var2)` = tabulates the cut factor variable vs another variable in the dataset
531+
* `table(cutVariable, data$var2)` = tabulates the cut factor variable vs another variable in the dataset (ie; builds a contingency table using cross-classifying factors)
532532
* `prop.table(table, margin=1)` = converts a table to a proportion table
533533
- `margin=1` = calculate the proportions based on the rows
534534
- `margin=2` = calculate the proportions based on the columns
@@ -875,10 +875,10 @@ matlines(testFaith$waiting,pred1,type="l",,col=c(1,2,2),lty = c(1,1,1), lwd=3)
875875
+ multiple predictors (dummy/indicator variables) are created for factor variables
876876
- `plot(lm$finalModel)` = construct 4 diagnostic plots for evaluating the model
877877
+ ***Note**: more information on these plots can be found at `?plot.lm` *
878-
+ ***Residual vs Fitted***
878+
+ ***Residuals vs Fitted***
879879
+ ***Normal Q-Q***
880880
+ ***Scale-Location***
881-
+ ***Residual vs Leverage***
881+
+ ***Residuals vs Leverage***
882882

883883
```{r fig.align = 'center'}
884884
# create train and test sets
@@ -894,9 +894,16 @@ par(mfrow = c(2, 2))
894894
plot(finMod,pch=19,cex=0.5,col="#00000010")
895895
```
896896

897-
* plotting residuals by index can be helpful in showing missing variables
897+
* plotting residuals by fitted values and coloring with a variable not used in the model helps spot a trend in that variable.
898+
899+
```{r fig.width = 4, fig.height = 3, fig.align = 'center'}
900+
# plot fitted values by residuals
901+
qplot(finMod$fitted, finMod$residuals, color=race, data=training)
902+
```
903+
904+
* plotting residuals by index (ie; row numbers) can be helpful in showing missing variables
898905
- `plot(finMod$residuals)` = plot the residuals against index (row number)
899-
- if there's a trend/pattern in the residuals, it is highly likely that another variable (such as age/time) should be included
906+
- if there's a trend/pattern in the residuals, it is highly likely that another variable (such as age/time) should be included.
900907
+ residuals should not have relationship to index
901908

902909
```{r fig.width = 4, fig.height = 3, fig.align = 'center'}

0 commit comments

Comments
 (0)