GitHub - Mustafa77/DAT3: General Assembly's Data Science course in Washington, DC

DAT3 Course Repository

Course materials for General Assembly's Data Science course in Washington, DC (10/2/14 - 12/18/14). View student work in the student repository.

Instructors: Josiah Davis and Kevin Markham

Course Project information

Week	Tuesday	Thursday
0		10/2: Introduction
1	10/7: Git and GitHub	10/9: Base Python
2	10/14: Getting and Cleaning Data	10/16: Exploratory Data Analysis
3	10/21: Linear Regression Milestone: Question and Data Set	10/23: Linear Regression Part 2
4	10/28: Machine Learning and KNN	10/30: Model Evaluation
5	11/4: Logistic Regression Milestone: Data Exploration and Analysis Plan	11/6: Logistic Regression Part 2, Clustering
6	11/11: Dimension Reduction	11/13: Clustering Part 2, Naive Bayes
7	11/18: NLP	11/20: Decision Trees
8	11/25: Recommenders Milestone: First Draft Due	Thanksgiving
9	12/2: Ensembling: Random Forests	12/4: Ensembling: Boosting
10	12/9: Review Milestone: Second Draft Due	12/11: Neural Networks
11	12/16: Project Presentations	12/18: Project Presentations

Class 1: Introduction

Introduction to General Assembly
Course overview and philosophy (slides)
What is data science? (slides)
Brief demo of Slack

Homework:

Install Anaconda distribution of Python 2.7, Git, and Slack
Add a photo to your Slack profile
Create a GitHub account
Read Analyzing the Analyzers (40 pages) and think about where you'd like to fit in!

Optional:

Subscribe to some data-focused newsletters, to keep current: Center for Data Innovation, O'Reilly Data Newsletter, Data Community DC
Watch Introduction to Data Science and Analysis (50 minutes) for another look at the data science workflow
Find an open source project hosted on GitHub that interests you

Class 2: Git and GitHub

Homework discussion: Any installation issues? Find any interesting GitHub projects? Any takeaways from "Analyzing the Analyzers"?
Introduce yourself: What's your technical background? Why did you join this course? How do you define success in this course?
Office hours
Git and GitHub lesson (slides)
- Create a repo on GitHub, clone it, make changes, and push up to GitHub
- Fork the DAT3-students repo, clone it, add a Markdown file (about.md) in your folder, push up to GitHub, and create a pull request

Homework:

Review the course project information, past projects from other GA students, and public data sources

Optional:

Clone this repo (DAT3) for easy access to the course files
Watch Introduction to Git and GitHub (36 minutes) to repeat a lot of today's presentation
Read the first two chapters of Pro Git for a much deeper understanding of version control and the basic Git commands
Learn some more Markdown and add it to your about.md file, then push those edits to GitHub and send another pull request
Read this friendly command line tutorial if you are brand new to the command line
For more project inspiration, browse the student projects from Andrew Ng's Machine Learning course at Stanford

Resources:

Dillinger is a browser-based Markdown editor, useful for checking your Markdown code
GitRef is an excellent reference guide for Git commands
Git quick reference for beginners is a shorter reference guide with commands grouped by workflow

Class 3: Base Python

Any questions about Git/GitHub?
Discuss the course project. What's one thing you learned from reviewing student projects?
Base Python lesson, with exercises (code)

Homework:

Complete the exercises at the end of the Python script we went over in class today and add your solutions to your folder in the DAT3-students repo
Keep thinking about your project, and consult past projects and public data sources for more inspiration

Class 4: Getting and Cleaning Data

Discuss homework solutions (code)
File input/output in Python
- Article, original data, modified data
- Open in Sublime Text
- Reading and writing files (code)
Getting data from APIs
- What is an API? Why provide one?
- Apigee: API providers, Echo Nest API console
- Echo Nest Developer Center for API key and documentation
- Three options for reading data into Python (code):
  - curl to file, view file in browser, read with json module
  - Use requests
  - Use Pyechonest

Homework:

Exercise 2 from file input/output
Read What I do when I get a new data set as told through tweets
Watch Look at Your Data (18 minutes)

Optional:

Exercise 3 from file input/output
Read this fun article about using web scraping to analyze Netflix's "micro-genres"

Resources:

Online Python Tutor is useful for visualizing (and debugging) your code
Directory of API wrappers for Python

Class 5: Exploratory Data Analysis

Discuss homework solutions (code)
Scraping the web for data
- What is web scraping? Why use it?
- Web scraping example (code):
  - Pages to scrape using Beautiful Soup 4
  - Adapted from Web scraping 101 with Python
Pandas for data analysis (code)
- Split-Apply-Combine pattern

Homework:

Project milestone: Submit your question and data set to DAT3-students by Tuesday!
Read through this excellent example of data wrangling and exploration in Pandas

Optional:

To learn more Pandas, read through this three-part tutorial (some overlap with today's class), or read through these two excellent (but extremely long) notebooks: Introduction to Pandas, Data Wrangling with Pandas

Resources:

For more web scraping with Beautiful Soup 4, here's a longer example: slides, code
Web scraping without writing any code: "turn any website into an API" with import.io or kimono
Simple examples of joins in Pandas, for when you need to merge multiple DataFrames together

Class 6: Linear Regression

Discuss your project question and data set
Pandas for visualization (code)
Linear regression (code, slides)
- What is linear regression?
- How to interpret the output?
- What assumptions does linear regression depend upon?
- What is multicollinearity and heteroskedasticity, and why should I care?
- How do I represent categorical variables?

Optional:

Post your favorite visualization in the "viz" channel on Slack, and tell us what you like about it!

Resources:

For more on Pandas plotting, browse through this IPython notebook or read the visualization page from the official Pandas documentation
To learn how to customize your plots further, browse through this IPython notebook on matplotlib
To explore different types of visualizations and when to use them, Choosing a Good Chart is a handy one-page reference, and here is an excellent slide deck from Columbia's Data Mining class
If you are already a master of ggplot2 in R, you may prefer "ggplot for Python" over matplotlib: introduction, tutorial

Class 7: Linear Regression Part 2

Linear regression, continued

Homework:

Complete the exercises at the end of the python script from class

Resources:

One of the best places to go for more information about linear regression is chapter 3 of our course "textbook": An Introduction to Statistical Learning - or just read Kevin's highly abbreviated version
For more information about core assumptions, check out this article and this one
For more on log transformations, check out this article
This handout provides an overview of the computation of the F-test
This may be a helpful article on how to derive the coefficient estimates

Class 8: Machine Learning and KNN

Discuss homework solutions (code)
"Human learning" on iris data using Pandas (code)
Introduction to numpy (code)
Machine learning and K-Nearest Neighbors (slides)

Homework:

Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it on Thursday

Optional:

Walk through the rest of the numpy reference and see if you can understand each of the functions

Resources:

For a more thorough introduction to numpy, this guide is quite good

Class 9: Model Evaluation

Introduction to scikit-learn with iris data (code)
Discuss the article on the bias-variance tradeoff
Model evaluation procedures (slides, code)
- Training error
- Underfitting and overfitting
- Test set approach
- Cross-validation
Model evaluation metrics (slides, code)
- Confusion matrix
Introduction to Kaggle

Homework:

Project milestone: Submit your "Data Exploration and Analysis Plan" to DAT3-students by Tuesday!
Read this simple example of machine learning and see if you understand everything in the article
Watch Kevin's Kaggle project presentation video (16 minutes) for a tour of the machine learning process

Optional:

For more on Kaggle, watch the video Kaggle Transforms Data Science Into Competitive Sport (28 minutes)
For much more on the Kaggle Allstate competition, read Kevin's project paper, read a brief interview with the first place team, review the Python code from the second place team, or skim the solution sharing thread
If you want to try out the Kaggle Bike Sharing Demand competition, feel free to reuse Kevin's starter code

Resources:

If you'd like to see more on today's topics, these videos from Hastie and Tibshirani are excellent: bias-variance tradeoff (10 minutes), test set (aka "validation set") approach (14 minutes), cross-validation (14 minutes) - or just read section 5.1 from their book (free PDF download!)
Kevin wrote a simple guide to confusion matrix terminology that you can use as a reference guide
The Kaggle wiki has a decent page describing other common model evaluation metrics

Class 10: Logistic Regression

Any questions from last time: model evaluation, Kaggle, article on Smart Autofill?
Summary of your feedback
Discuss your data exploration and analysis plan
Logistic Regression (slides, code)

Homework:

Continue to work on Part I of the exercise from class and submit your solution to DAT3-students

Class 11: Logistic Regression Part 2, Clustering

Logistic Regression, continued (exercise solution)
Clustering (slides)
- Why cluster?
- Introduction to the K-means algorithm

Homework:

Read through section 8.2 on K-means Clustering from Introduction to Data Mining by next Thursday. What are some of the strengths and limitations of k-means clustering?

Resources:

If you would like a review on the topics we covered today (and Tuesday), the videos from Hastie and Tibshirani from Stanford are very good:
- Introduction to Classification (10 minutes)
- Logistic Regression and Maximum Likelihood (9 minutes)
- Multivariate Logistic Regression and Confounding Variables (10 minutes)
If you want to understand the math of how coefficients are estimated, check out these notes from CMU's Advanced Data Analysis class. Written by Cosma Shalizi, one of CMU's professors.
Documentation for plotting math text
Documentation for plotting scatter plots

Class 12: Dimension Reduction

Model evaluation metrics, continued
- ROC curves and AUC (visualization, code)
- Root Mean Squared Error (slides)
Dimension Reduction (Guest Lecturer: Sinan Ozdemir)
- Slides
- Code: PCA and SVD
- Code: image compression with PCA (original source)

Homework:

Read Paul Graham's "A Plan for Spam" in preparation for Thursday's class on Naive Bayes

Resources:

scikit-learn has extensive documentation on model evaluation
On Cross Validated, this question has dozens of explanations of PCA, and this question has a useful visualization of what is essentially PCA

Class 13: Clustering Part 2, Naive Bayes

Clustering Analysis (slides)
- Understanding the K-means algorithm
- Choosing K for k-means
- Exercise
- Visualizing data in multi-dimensional space
Naive Bayes (slides, code)
- Briefly discuss "A Plan for Spam"
- Probability and Bayes' Theorem (original source, confusion matrix reference)
- Naive Bayes classification
- Creating a spam classifier

Resources:

For clustering, scikit-learn has documentation on K-means clustering, alternative clustering algorithms, and clustering metrics
Vipin Kumar from the University of Minnesota has a helpful chapter on clustering from his textbook: Introduction to Data Mining
For an alternative introduction to Bayes' Theorem, Bayes' Rule for Ducks, Bayes' Rule in an animated gif, and this 5-minute video on conditional probability may be helpful
For more details on Naive Bayes classification, Wikipedia has two useful articles: Naive Bayes classifier, Naive Bayes spam filtering
If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
code		code
data		data
slides		slides
README.md		README.md
project.md		project.md
public_data.md		public_data.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DAT3 Course Repository

Class 1: Introduction

Class 2: Git and GitHub

Class 3: Base Python

Class 4: Getting and Cleaning Data

Class 5: Exploratory Data Analysis

Class 6: Linear Regression

Class 7: Linear Regression Part 2

Class 8: Machine Learning and KNN

Class 9: Model Evaluation

Class 10: Logistic Regression

Class 11: Logistic Regression Part 2, Clustering

Class 12: Dimension Reduction

Class 13: Clustering Part 2, Naive Bayes

About

Uh oh!

Releases

Packages

Mustafa77/DAT3

Folders and files

Latest commit

History

Repository files navigation

DAT3 Course Repository

Class 1: Introduction

Class 2: Git and GitHub

Class 3: Base Python

Class 4: Getting and Cleaning Data

Class 5: Exploratory Data Analysis

Class 6: Linear Regression

Class 7: Linear Regression Part 2

Class 8: Machine Learning and KNN

Class 9: Model Evaluation

Class 10: Logistic Regression

Class 11: Logistic Regression Part 2, Clustering

Class 12: Dimension Reduction

Class 13: Clustering Part 2, Naive Bayes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages