| 
 | 1 | +{  | 
 | 2 | + "cells": [  | 
 | 3 | +  {  | 
 | 4 | +   "cell_type": "markdown",  | 
 | 5 | +   "metadata": {},  | 
 | 6 | +   "source": [  | 
 | 7 | +    "# Introduction to Statistical Learning \n",  | 
 | 8 | +    "Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani is considered a canonical text in the field of statistical/machine learning and is an absolutely fantastic way to move forward in your analytics career. [The text is free to download](http://www-bcf.usc.edu/~gareth/ISL/) and an [online course by the authors themselves](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about) is currently available in self-pace mode, meaning you can complete it any time. Make sure to **[REGISTER FOR THE STANDFORD COURSE!](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about)** The videos have also been [archived here on youtube](http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/)."  | 
 | 9 | +   ]  | 
 | 10 | +  },  | 
 | 11 | +  {  | 
 | 12 | +   "cell_type": "markdown",  | 
 | 13 | +   "metadata": {},  | 
 | 14 | +   "source": [  | 
 | 15 | +    "# Chapter 10: Unsupervised Learning\n",  | 
 | 16 | +    "Book has been about supervised learning until now. No response variable. Will try and find interesting things in explanatory variables. Chapter will focus on principal components analysis and clustering. No always an exact goal.\n",  | 
 | 17 | +    "\n",  | 
 | 18 | +    "## Principal Components Analysis\n",  | 
 | 19 | +    "Chapter 6 covered principal component regression where the original features were mapped to a smaller feature space that are then used as inputs into linear regression solved normally through least squares.\n",  | 
 | 20 | +    "\n",  | 
 | 21 | +    "PCA can be used to visualize high dimensional data in 2 or 3 dimensions. The first component is a weighted linear combination of all the original features where the sum of the squared weights equals 1. These weights are the loading factors. The loading factors of the first principal component maximize the weighted sum of the features for each observation.\n",  | 
 | 22 | +    "\n",  | 
 | 23 | +    "The second principal component is uncorrelated with the first which makes it orthogonal to it.\n",  | 
 | 24 | +    "\n",  | 
 | 25 | +    "The first PC can also be interpreted as the line closest to the data.\n",  | 
 | 26 | +    "\n",  | 
 | 27 | +    "Very important to scale the data first - 0 mean, 1 std. The variances won't make sense otherwise.\n",  | 
 | 28 | +    "\n",  | 
 | 29 | +    "### Proportion of variance explained\n",  | 
 | 30 | +    "Each principal component explains some of the variance of the original data. We can find the proportion that each principal component explains by dividing each components variance by the total raw variance. Summing all the variances for each component equals 1.\n",  | 
 | 31 | +    "\n",  | 
 | 32 | +    "Examine a scree plot (for an elbow) to choose the number of principal components to use. Or can use cross validation to choose.\n",  | 
 | 33 | +    "\n",  | 
 | 34 | +    "## Clustering\n",  | 
 | 35 | +    "Finding groups within data that are similar.\n",  | 
 | 36 | +    "\n",  | 
 | 37 | +    "Can cluster by using the features or by the observation (transposing data matrix)\n",  | 
 | 38 | +    "\n",  | 
 | 39 | +    "### K-Means\n",  | 
 | 40 | +    "Clustering where you define the number of clusters ahead of time. Algorithm works iteratively by first randomly choosing assigning each point to a cluster and computing cluster centers. All points are then reassigned based on euclidean distance to centroids. A new centroid is found by averaging the points in each cluster. Process stops after centroids stop moving or some max number of iterations.\n",  | 
 | 41 | +    "\n",  | 
 | 42 | +    "Can do initial assignment multiple times and choose clustering assignment with least total variance.\n",  | 
 | 43 | +    "\n",  | 
 | 44 | +    "### Hierarchical Clustering\n",  | 
 | 45 | +    "No need to pre-specify number of clusters. Most common type is bottom-up or Agglomerative. \n",  | 
 | 46 | +    "\n",  | 
 | 47 | +    "#### Interpreting Dendogram\n",  | 
 | 48 | +    "Similarity of points should be determined by the vertical axis not the horizontal axis. The lower on the dendogram that they are connected, the closer they are.\n",  | 
 | 49 | +    "\n",  | 
 | 50 | +    "Hierarchical clustering works by putting each point in its own cluster. Then each pairwise dissimilarity is computed and the least dissimilar clusters are fused. This dissimilarity is the height of the dendogram. Dissimilarity is calculated through a type of linkage and distance metric (usually euclidean).\n",  | 
 | 51 | +    "\n",  | 
 | 52 | +    "## Expectation Maximization\n",  | 
 | 53 | +    "K-means and hieracrchical cluserting are 'hard' clustering meaning that each observation belongs to exactly one cluster. There are other clustering algorithms that 'soft' cluster meaning that observations can belong to multiple clusters.\n",  | 
 | 54 | +    "\n",  | 
 | 55 | +    "One way to perform soft clustering is by assuming that each cluster is modeled by a normal (gaussian) distribution. And so the whole data set is a mixture of gaussian also called Gaussian Mixture Model.\n",  | 
 | 56 | +    "\n",  | 
 | 57 | +    "The goal here is to find the parameters for the multivariate gaussian and assign each observation a probability of being in a certain cluster.\n",  | 
 | 58 | +    "\n",  | 
 | 59 | +    "### EM algorithm\n",  | 
 | 60 | +    "In K-means, we start the algorithm by randomly assigning each point a cluster to find the first centroid. Somewhat similarly, the EM algorithm randomly assigns the parameters of gaussian distribution for each cluster. Then using bayes theorem (with initial priors as all equal), we can determine the probability of each point being a part of each cluster. This is expectation step.\n",  | 
 | 61 | +    "\n",  | 
 | 62 | +    "The maximization step is to recalculate the parameters(mean and covariance) of the multivariate gaussian distribution using the now weighted (by probabilities) observations.\n",  | 
 | 63 | +    "\n",  | 
 | 64 | +    "\n",  | 
 | 65 | +    "\n",  | 
 | 66 | +    "[Excellent video explanation of EM](https://www.youtube.com/watch?v=REypj2sy_5U)"  | 
 | 67 | +   ]  | 
 | 68 | +  },  | 
 | 69 | +  {  | 
 | 70 | +   "cell_type": "code",  | 
 | 71 | +   "execution_count": null,  | 
 | 72 | +   "metadata": {  | 
 | 73 | +    "collapsed": true  | 
 | 74 | +   },  | 
 | 75 | +   "outputs": [],  | 
 | 76 | +   "source": []  | 
 | 77 | +  }  | 
 | 78 | + ],  | 
 | 79 | + "metadata": {  | 
 | 80 | +  "anaconda-cloud": {},  | 
 | 81 | +  "kernelspec": {  | 
 | 82 | +   "display_name": "Python 3",  | 
 | 83 | +   "language": "python",  | 
 | 84 | +   "name": "python3"  | 
 | 85 | +  },  | 
 | 86 | +  "language_info": {  | 
 | 87 | +   "codemirror_mode": {  | 
 | 88 | +    "name": "ipython",  | 
 | 89 | +    "version": 3  | 
 | 90 | +   },  | 
 | 91 | +   "file_extension": ".py",  | 
 | 92 | +   "mimetype": "text/x-python",  | 
 | 93 | +   "name": "python",  | 
 | 94 | +   "nbconvert_exporter": "python",  | 
 | 95 | +   "pygments_lexer": "ipython3",  | 
 | 96 | +   "version": "3.6.1"  | 
 | 97 | +  }  | 
 | 98 | + },  | 
 | 99 | + "nbformat": 4,  | 
 | 100 | + "nbformat_minor": 1  | 
 | 101 | +}  | 
0 commit comments