started chapter 5

tdpetrou · tdpetrou · commit 8e1d7d4422db · 2016-08-30T07:58:31.000-05:00
diff --git a/Introduction to Statistical Learning/Chapter 5.ipynb b/Introduction to Statistical Learning/Chapter 5.ipynb
@@ -0,0 +1,110 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Introduction to Statistical Learning \n",
+    "Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani is considered a canonical text in the field of statistical/machine learning and is an absolutely fantastic way to move forward in your analytics career. [The text is free to download](http://www-bcf.usc.edu/~gareth/ISL/) and an [online course by the authors themselves](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about) is currently available in self-pace mode, meaning you can complete it any time. Make sure to **[REGISTER FOR THE STANDFORD COURSE!](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about)** The videos have also been [archived here on youtube](http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/).\n",
+    "\n",
+    "# How will Houston Data Science cover the course?\n",
+    "The Stanford online course covers the entire book in 9 weeks and with the R programming language. The pace that we cover the book is yet to be determined as there are many unknown variables such as interest from members, availability of a venue and general level of skills of those participating. That said, a meeting once per week to discuss the current chapter or previous chapter solutions is the target.\n",
+    "\n",
+    "\n",
+    "# Python in place of R\n",
+    "Although R is a fantastic programming language and is the language that all the ISLR labs are written in, the Python programming language, except for rare exceptions, contains analgous libraries that contain the same statistical functionality as those in R.\n",
+    "\n",
+    "# Notes, Exercises and Programming Assignments all in the Jupyter Notebok\n",
+    "ISLR has both end of chapter problems and programming assignments. All chapter problems and programming assignments will be answered in the notebook.\n",
+    "\n",
+    "# Replicating Plots\n",
+    "The plots in ISLR are created in R. Many of them will be replicated here in the notebook when they appear in the text\n",
+    "\n",
+    "# Book Data\n",
+    "The data from the books was downloaded using R. All the datasets are found in either the MASS or ISLR packages. They are now in the data directory. See below"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[31mAdvertising.csv\u001b[m\u001b[m caravan.csv     hitters.csv     khan_ytrain.csv smarket.csv\r\n",
+      "Credit.csv      carseats.csv    khan_xtest.csv  nci60_data.csv  usarrests.csv\r\n",
+      "auto.csv        college.csv     khan_xtrain.csv nci60_labs.csv  \u001b[31mwage.csv\u001b[m\u001b[m\r\n",
+      "boston.csv      default.csv     khan_ytest.csv  portfolio.csv   weekly.csv\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "!ls data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# ISLR Videos\n",
+    "[All Old Videos](https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Chapter 5 Resampling Methods\n",
+    "Covers resampling data through bootstraping and cross validation. Cross validation gets us an error estimate for our test data and boostraping provides estimates for parameter accuracy.\n",
+    "\n",
+    "### Cross Validation\n",
+    "Usually a test set is not available so a simple strategy to create one is to split the available data into training and testing (validation set). For quantitative responses usually use MSE, for categorical can use error rate, area under the curve, F1 score, weighting of confusion matrix, etc...\n",
+    "\n",
+    "### Leave One Out Cross Validation\n",
+    "LOOCV has only one observation in the test set and uses all other n-1 observations to build a model. n different models are built leaving out each observation once and error is averaged over these n trials.  LOOCV is better than simple method above. Model is built on nearly all the data and there is no randomness in the splits since each observation will be left out once. It is computationally expensive especially with large n and a complex model.\n",
+    "\n",
+    "### k-fold cross validation\n",
+    "Similar to LOOCV but this time you leave some number greater than 1 out. Here, k is the number of partitions of your sample, so if you have 1000 obsevations and k = 10, the each fold will be 100. These 100 observations would act as your test set. Get an MSE for each fold of these 100 observations and take the average. LOOCV is a special case of k-fold CV whenever k equals the number of observations.\n",
+    "\n",
+    "### bias-variance tradeoff between LOOCV and k-folds\n",
+    "Since LOOCV trains on nearly all the data, the test error rate will generally be lower than k-fold and there for less biased. LOOCV will have higher variance since all n models will be very highly correlated to one another. Since the models won't differ much, the test error rate (which what CV is measuring) will vary more than k-fold which has fewer models that are less correlated with one another. A value of k between 5 and 10 is a good rule of thumb that balances the tradeoff between bias and variance"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# can do example where LOOCV has higher variance than k-fold"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python [Root]",
+   "language": "python",
+   "name": "Python [Root]"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}