|
5 | 5 | "metadata": {}, |
6 | 6 | "source": [ |
7 | 7 | "# Introduction to Statistical Learning \n", |
8 | | - "Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani is considered a canonical text in the field of statistical/machine learning and is an absolutely fantastic way to move forward in your analytics career. [The text is free to download](http://www-bcf.usc.edu/~gareth/ISL/) and an [online course by the authors themselves](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about) is currently available in self-pace mode, meaning you can complete it any time. Make sure to **[REGISTER FOR THE STANDFORD COURSE!](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about)** The videos have also been [archived here on youtube](http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/).\n", |
9 | | - "\n", |
10 | | - "# How will Houston Data Science cover the course?\n", |
11 | | - "The Stanford online course covers the entire book in 9 weeks and with the R programming language. The pace that we cover the book is yet to be determined as there are many unknown variables such as interest from members, availability of a venue and general level of skills of those participating. That said, a meeting once per week to discuss the current chapter or previous chapter solutions is the target.\n", |
12 | | - "\n", |
13 | | - "\n", |
14 | | - "# Python in place of R\n", |
15 | | - "Although R is a fantastic programming language and is the language that all the ISLR labs are written in, the Python programming language, except for rare exceptions, contains analgous libraries that contain the same statistical functionality as those in R.\n", |
16 | | - "\n", |
17 | | - "# Notes, Exercises and Programming Assignments all in the Jupyter Notebok\n", |
18 | | - "ISLR has both end of chapter problems and programming assignments. All chapter problems and programming assignments will be answered in the notebook.\n", |
19 | | - "\n", |
20 | | - "# Replicating Plots\n", |
21 | | - "The plots in ISLR are created in R. Many of them will be replicated here in the notebook when they appear in the text\n", |
22 | | - "\n", |
23 | | - "# Book Data\n", |
24 | | - "The data from the books was downloaded using R. All the datasets are found in either the MASS or ISLR packages. They are now in the data directory. See below" |
25 | | - ] |
26 | | - }, |
27 | | - { |
28 | | - "cell_type": "code", |
29 | | - "execution_count": 1, |
30 | | - "metadata": { |
31 | | - "collapsed": false |
32 | | - }, |
33 | | - "outputs": [ |
34 | | - { |
35 | | - "name": "stdout", |
36 | | - "output_type": "stream", |
37 | | - "text": [ |
38 | | - "\u001b[31mAdvertising.csv\u001b[m\u001b[m* carseats.csv khan_xtrain.csv portfolio.csv\r\n", |
39 | | - "Credit.csv college.csv khan_ytest.csv smarket.csv\r\n", |
40 | | - "auto.csv default.csv khan_ytrain.csv usarrests.csv\r\n", |
41 | | - "boston.csv hitters.csv nci60_data.csv \u001b[31mwage.csv\u001b[m\u001b[m*\r\n", |
42 | | - "caravan.csv khan_xtest.csv nci60_labs.csv weekly.csv\r\n" |
43 | | - ] |
44 | | - } |
45 | | - ], |
46 | | - "source": [ |
47 | | - "ls data" |
48 | | - ] |
49 | | - }, |
50 | | - { |
51 | | - "cell_type": "markdown", |
52 | | - "metadata": {}, |
53 | | - "source": [ |
54 | | - "# ISLR Videos\n", |
55 | | - "[All Old Videos](https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/)" |
| 8 | + "Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani is considered a canonical text in the field of statistical/machine learning and is an absolutely fantastic way to move forward in your analytics career. [The text is free to download](http://www-bcf.usc.edu/~gareth/ISL/) and an [online course by the authors themselves](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about) is currently available in self-pace mode, meaning you can complete it any time. Make sure to **[REGISTER FOR THE STANDFORD COURSE!](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about)** The videos have also been [archived here on youtube](http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/)." |
56 | 9 | ] |
57 | 10 | }, |
58 | 11 | { |
|
69 | 22 | "\n", |
70 | 23 | "The second principal component is uncorrelated with the first which makes it orthogonal to it.\n", |
71 | 24 | "\n", |
72 | | - "The first pc can also be interpretted as the line closest to the data." |
| 25 | + "The first PC can also be interpreted as the line closest to the data.\n", |
| 26 | + "\n", |
| 27 | + "Very important to scale the data first - 0 mean, 1 std. The variances won't make sense otherwise.\n", |
| 28 | + "\n", |
| 29 | + "### Proportion of variance explained\n", |
| 30 | + "Each principal component explains some of the variance of the original data. We can find the proportion that each principal component explains by dividing each components variance by the total raw variance. Summing all the variances for each component equals 1.\n", |
| 31 | + "\n", |
| 32 | + "Examine a scree plot (for an elbow) to choose the number of principal components to use. Or can use cross validation to choose.\n", |
| 33 | + "\n", |
| 34 | + "## Clustering\n", |
| 35 | + "Finding groups within data that are similar.\n", |
| 36 | + "\n", |
| 37 | + "Can cluster by using the features or by the observation (transposing data matrix)\n", |
| 38 | + "\n", |
| 39 | + "### K-Means\n", |
| 40 | + "Clustering where you define the number of clusters ahead of time. Algorithm works iteratively by first randomly choosing assigning each point to a cluster and computing cluster centers. All points are then reassigned based on euclidean distance to centroids. A new centroid is found by averaging the points in each cluster. Process stops after centroids stop moving or some max number of iterations.\n", |
| 41 | + "\n", |
| 42 | + "Can do initial assignment multiple times and choose clustering assignment with least total variance.\n", |
| 43 | + "\n", |
| 44 | + "### Hierarchical Clustering\n", |
| 45 | + "No need to pre-specify number of clusters. Most common type is bottom-up or Agglomerative. \n", |
| 46 | + "\n", |
| 47 | + "#### Interpreting Dendogram\n", |
| 48 | + "Similarity of points should be determined by the vertical axis not the horizontal axis. The lower on the dendogram that they are connected, the closer they are.\n", |
| 49 | + "\n", |
| 50 | + "Hierarchical clustering works by putting each point in its own cluster. Then each pairwise dissimilarity is computed and the least dissimilar clusters are fused. This dissimilarity is the height of the dendogram. Dissimilarity is calculated through a type of linkage and distance metric (usually euclidean).\n", |
| 51 | + "\n", |
| 52 | + "## Expectation Maximization\n", |
| 53 | + "K-means and hieracrchical cluserting are 'hard' clustering meaning that each observation belongs to exactly one cluster. There are other clustering algorithms that 'soft' cluster meaning that observations can belong to multiple clusters.\n", |
| 54 | + "\n", |
| 55 | + "One way to perform soft clustering is by assuming that each cluster is modeled by a normal (gaussian) distribution. And so the whole data set is a mixture of gaussian also called Gaussian Mixture Model.\n", |
| 56 | + "\n", |
| 57 | + "The goal here is to find the parameters for the multivariate gaussian and assign each observation a probability of being in a certain cluster.\n", |
| 58 | + "\n", |
| 59 | + "### EM algorithm\n", |
| 60 | + "In K-means, we start the algorithm by randomly assigning each point a cluster to find the first centroid. Somewhat similarly, the EM algorithm randomly assigns the parameters of gaussian distribution for each cluster. Then using bayes theorem (with initial priors as all equal), we can determine the probability of each point being a part of each cluster. This is expectation step.\n", |
| 61 | + "\n", |
| 62 | + "The maximization step is to recalculate the parameters(mean and covariance) of the multivariate gaussian distribution using the now weighted (by probabilities) observations.\n", |
| 63 | + "\n", |
| 64 | + "\n", |
| 65 | + "\n", |
| 66 | + "[Excellent video explanation of EM](https://www.youtube.com/watch?v=REypj2sy_5U)" |
73 | 67 | ] |
74 | 68 | }, |
75 | 69 | { |
|
85 | 79 | "metadata": { |
86 | 80 | "anaconda-cloud": {}, |
87 | 81 | "kernelspec": { |
88 | | - "display_name": "Python [Root]", |
| 82 | + "display_name": "Python 3", |
89 | 83 | "language": "python", |
90 | | - "name": "Python [Root]" |
| 84 | + "name": "python3" |
91 | 85 | }, |
92 | 86 | "language_info": { |
93 | 87 | "codemirror_mode": { |
|
99 | 93 | "name": "python", |
100 | 94 | "nbconvert_exporter": "python", |
101 | 95 | "pygments_lexer": "ipython3", |
102 | | - "version": "3.5.2" |
| 96 | + "version": "3.6.1" |
103 | 97 | } |
104 | 98 | }, |
105 | 99 | "nbformat": 4, |
106 | | - "nbformat_minor": 0 |
| 100 | + "nbformat_minor": 1 |
107 | 101 | } |
0 commit comments