Skip to content

Commit 8e1d7d4

Browse files
committed
started chapter 5
1 parent d5a378a commit 8e1d7d4

File tree

1 file changed

+110
-0
lines changed

1 file changed

+110
-0
lines changed
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Introduction to Statistical Learning \n",
8+
"Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani is considered a canonical text in the field of statistical/machine learning and is an absolutely fantastic way to move forward in your analytics career. [The text is free to download](http://www-bcf.usc.edu/~gareth/ISL/) and an [online course by the authors themselves](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about) is currently available in self-pace mode, meaning you can complete it any time. Make sure to **[REGISTER FOR THE STANDFORD COURSE!](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about)** The videos have also been [archived here on youtube](http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/).\n",
9+
"\n",
10+
"# How will Houston Data Science cover the course?\n",
11+
"The Stanford online course covers the entire book in 9 weeks and with the R programming language. The pace that we cover the book is yet to be determined as there are many unknown variables such as interest from members, availability of a venue and general level of skills of those participating. That said, a meeting once per week to discuss the current chapter or previous chapter solutions is the target.\n",
12+
"\n",
13+
"\n",
14+
"# Python in place of R\n",
15+
"Although R is a fantastic programming language and is the language that all the ISLR labs are written in, the Python programming language, except for rare exceptions, contains analgous libraries that contain the same statistical functionality as those in R.\n",
16+
"\n",
17+
"# Notes, Exercises and Programming Assignments all in the Jupyter Notebok\n",
18+
"ISLR has both end of chapter problems and programming assignments. All chapter problems and programming assignments will be answered in the notebook.\n",
19+
"\n",
20+
"# Replicating Plots\n",
21+
"The plots in ISLR are created in R. Many of them will be replicated here in the notebook when they appear in the text\n",
22+
"\n",
23+
"# Book Data\n",
24+
"The data from the books was downloaded using R. All the datasets are found in either the MASS or ISLR packages. They are now in the data directory. See below"
25+
]
26+
},
27+
{
28+
"cell_type": "code",
29+
"execution_count": 1,
30+
"metadata": {
31+
"collapsed": false
32+
},
33+
"outputs": [
34+
{
35+
"name": "stdout",
36+
"output_type": "stream",
37+
"text": [
38+
"\u001b[31mAdvertising.csv\u001b[m\u001b[m caravan.csv hitters.csv khan_ytrain.csv smarket.csv\r\n",
39+
"Credit.csv carseats.csv khan_xtest.csv nci60_data.csv usarrests.csv\r\n",
40+
"auto.csv college.csv khan_xtrain.csv nci60_labs.csv \u001b[31mwage.csv\u001b[m\u001b[m\r\n",
41+
"boston.csv default.csv khan_ytest.csv portfolio.csv weekly.csv\r\n"
42+
]
43+
}
44+
],
45+
"source": [
46+
"!ls data"
47+
]
48+
},
49+
{
50+
"cell_type": "markdown",
51+
"metadata": {},
52+
"source": [
53+
"# ISLR Videos\n",
54+
"[All Old Videos](https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/)"
55+
]
56+
},
57+
{
58+
"cell_type": "markdown",
59+
"metadata": {},
60+
"source": [
61+
"# Chapter 5 Resampling Methods\n",
62+
"Covers resampling data through bootstraping and cross validation. Cross validation gets us an error estimate for our test data and boostraping provides estimates for parameter accuracy.\n",
63+
"\n",
64+
"### Cross Validation\n",
65+
"Usually a test set is not available so a simple strategy to create one is to split the available data into training and testing (validation set). For quantitative responses usually use MSE, for categorical can use error rate, area under the curve, F1 score, weighting of confusion matrix, etc...\n",
66+
"\n",
67+
"### Leave One Out Cross Validation\n",
68+
"LOOCV has only one observation in the test set and uses all other n-1 observations to build a model. n different models are built leaving out each observation once and error is averaged over these n trials. LOOCV is better than simple method above. Model is built on nearly all the data and there is no randomness in the splits since each observation will be left out once. It is computationally expensive especially with large n and a complex model.\n",
69+
"\n",
70+
"### k-fold cross validation\n",
71+
"Similar to LOOCV but this time you leave some number greater than 1 out. Here, k is the number of partitions of your sample, so if you have 1000 obsevations and k = 10, the each fold will be 100. These 100 observations would act as your test set. Get an MSE for each fold of these 100 observations and take the average. LOOCV is a special case of k-fold CV whenever k equals the number of observations.\n",
72+
"\n",
73+
"### bias-variance tradeoff between LOOCV and k-folds\n",
74+
"Since LOOCV trains on nearly all the data, the test error rate will generally be lower than k-fold and there for less biased. LOOCV will have higher variance since all n models will be very highly correlated to one another. Since the models won't differ much, the test error rate (which what CV is measuring) will vary more than k-fold which has fewer models that are less correlated with one another. A value of k between 5 and 10 is a good rule of thumb that balances the tradeoff between bias and variance"
75+
]
76+
},
77+
{
78+
"cell_type": "code",
79+
"execution_count": null,
80+
"metadata": {
81+
"collapsed": true
82+
},
83+
"outputs": [],
84+
"source": [
85+
"# can do example where LOOCV has higher variance than k-fold"
86+
]
87+
}
88+
],
89+
"metadata": {
90+
"kernelspec": {
91+
"display_name": "Python [Root]",
92+
"language": "python",
93+
"name": "Python [Root]"
94+
},
95+
"language_info": {
96+
"codemirror_mode": {
97+
"name": "ipython",
98+
"version": 3
99+
},
100+
"file_extension": ".py",
101+
"mimetype": "text/x-python",
102+
"name": "python",
103+
"nbconvert_exporter": "python",
104+
"pygments_lexer": "ipython3",
105+
"version": "3.5.2"
106+
}
107+
},
108+
"nbformat": 4,
109+
"nbformat_minor": 0
110+
}

0 commit comments

Comments
 (0)