more edits

elmerehbi · elmerehbi · commit 8923e453f937 · 2015-05-20T12:16:11.000+03:00
diff --git a/6_STATINFERENCE/Statistical Inference Course Notes.Rmd b/6_STATINFERENCE/Statistical Inference Course Notes.Rmd
@@ -304,7 +304,7 @@ ggplot(dat, aes(x = x, y = y, color = factor)) + geom_line(size = 2)
 ```
 
 
-* **variance** = measure of spread, the square of expected distance from the mean (expressed in $X$'s units$^2$)
+* **variance** = measure of spread or dispersion, the expected squared distance of the variable from its mean (expressed in $X$'s units$^2$)
 	- as we can see from above, higher variances $\rightarrow$ more spread, lower $\rightarrow$ smaller spread
 	* $Var(X) = E[(X-\mu)^2] = E[X^2] - E[X]^2$
 	* **standard deviation** $= \sqrt{Var(X)}$ $\rightarrow$ has same units as X
@@ -352,7 +352,7 @@ grid.raster(readPNG("figures/8.png"))
 ```
 
 * **distribution for mean of random samples**
-	* expected value of the **mean** of distribution of means = expected value of the sample = population mean
+	* expected value of the **mean** of distribution of means = expected value of the sample mean = population mean
 		* $E[\bar X]=\mu$
 	* expected value of the variance of distribution of means
 		* $Var(\bar X) = \sigma^2/n$
diff --git a/8_PREDMACHLEARN/Practical Machine Learning Course Notes HTML.Rmd b/8_PREDMACHLEARN/Practical Machine Learning Course Notes HTML.Rmd
@@ -859,10 +859,10 @@ matlines(testFaith$waiting,pred1,type="l",,col=c(1,2,2),lty = c(1,1,1), lwd=3)
 		+ multiple predictors (dummy/indicator variables) are created for factor variables
 	- `plot(lm$finalModel)` = construct 4 diagnostic plots for evaluating the model
 		+ ***Note**: more information on these plots can be found at `?plot.lm` *
-		+ ***Residual vs Fitted***
+		+ ***Residuals vs Fitted***
 		+ ***Normal Q-Q***
 		+ ***Scale-Location***
-		+ ***Residual vs Leverage***
+		+ ***Residuals vs Leverage***
 
 ```{r fig.align = 'center'}
 # create train and test sets
@@ -878,15 +878,23 @@ par(mfrow = c(2, 2))
 plot(finMod,pch=19,cex=0.5,col="#00000010")
 ```
 
-* plotting residuals by index can be helpful in showing missing variables
+* plotting residuals by fitted values and coloring with a variable not used in the model helps spot a trend in that variable.
+
+```{r fig.width = 4, fig.height = 3, fig.align = 'center'}
+# plot fitted values by residuals 
+qplot(finMod$fitted, finMod$residuals, color=race, data=training)
+```
+
+* plotting residuals by index (ie; row numbers) can be helpful in showing missing variables
 	- `plot(finMod$residuals)` = plot the residuals against index (row number)
-	- if there's a trend/pattern in the residuals, it is highly likely that another variable (such as age/time) should be included
+	- if there's a trend/pattern in the residuals, it is highly likely that another variable (such as age/time) should be included.
 		+ residuals should not have relationship to index
 
 ```{r fig.width = 4, fig.height = 3, fig.align = 'center'}
 # plot residual by index
 plot(finMod$residuals,pch=19,cex=0.5)
 ```
+
 * here the residuals increase linearly with the index, and the highest residuals are concentrated in the higher indexes, so there must be a missing variable
 
 
diff --git a/8_PREDMACHLEARN/Practical Machine Learning Course Notes.Rmd b/8_PREDMACHLEARN/Practical Machine Learning Course Notes.Rmd
@@ -875,10 +875,10 @@ matlines(testFaith$waiting,pred1,type="l",,col=c(1,2,2),lty = c(1,1,1), lwd=3)
 		+ multiple predictors (dummy/indicator variables) are created for factor variables
 	- `plot(lm$finalModel)` = construct 4 diagnostic plots for evaluating the model
 		+ ***Note**: more information on these plots can be found at `?plot.lm` *
-		+ ***Residual vs Fitted***
+		+ ***Residuals vs Fitted***
 		+ ***Normal Q-Q***
 		+ ***Scale-Location***
-		+ ***Residual vs Leverage***
+		+ ***Residuals vs Leverage***
 
 ```{r fig.align = 'center'}
 # create train and test sets
@@ -894,9 +894,16 @@ par(mfrow = c(2, 2))
 plot(finMod,pch=19,cex=0.5,col="#00000010")
 ```
 
-* plotting residuals by index can be helpful in showing missing variables
+* plotting residuals by fitted values and coloring with a variable not used in the model helps spot a trend in that variable.
+
+```{r fig.width = 4, fig.height = 3, fig.align = 'center'}
+# plot fitted values by residuals 
+qplot(finMod$fitted, finMod$residuals, color=race, data=training)
+```
+
+* plotting residuals by index (ie; row numbers) can be helpful in showing missing variables
 	- `plot(finMod$residuals)` = plot the residuals against index (row number)
-	- if there's a trend/pattern in the residuals, it is highly likely that another variable (such as age/time) should be included
+	- if there's a trend/pattern in the residuals, it is highly likely that another variable (such as age/time) should be included.
 		+ residuals should not have relationship to index
 
 ```{r fig.width = 4, fig.height = 3, fig.align = 'center'}