Skip to content

Mustafa77/DAT3

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DAT3 Course Repository

Course materials for General Assembly's Data Science course in Washington, DC (10/2/14 - 12/18/14). View student work in the student repository.

Instructors: Josiah Davis and Kevin Markham

Course Project information

Week Tuesday Thursday
0 10/2: Introduction
1 10/7: Git and GitHub 10/9: Base Python
2 10/14: Getting and Cleaning Data 10/16: Exploratory Data Analysis
3 10/21: Linear Regression
Milestone: Question and Data Set
10/23: Linear Regression Part 2
4 10/28: Machine Learning and KNN 10/30: Model Evaluation
5 11/4: Logistic Regression
Milestone: Data Exploration and
Analysis Plan
11/6: Logistic Regression Part 2, Clustering
6 11/11: Dimension Reduction 11/13: Clustering Part 2, Naive Bayes
7 11/18: NLP 11/20: Decision Trees
8 11/25: Recommenders
Milestone: First Draft Due
Thanksgiving
9 12/2: Ensembling: Random Forests 12/4: Ensembling: Boosting
10 12/9: Review
Milestone: Second Draft Due
12/11: Neural Networks
11 12/16: Project Presentations 12/18: Project Presentations

Class 1: Introduction

  • Introduction to General Assembly
  • Course overview and philosophy (slides)
  • What is data science? (slides)
  • Brief demo of Slack

Homework:

Optional:

Class 2: Git and GitHub

  • Homework discussion: Any installation issues? Find any interesting GitHub projects? Any takeaways from "Analyzing the Analyzers"?
  • Introduce yourself: What's your technical background? Why did you join this course? How do you define success in this course?
  • Office hours
  • Git and GitHub lesson (slides)
    • Create a repo on GitHub, clone it, make changes, and push up to GitHub
    • Fork the DAT3-students repo, clone it, add a Markdown file (about.md) in your folder, push up to GitHub, and create a pull request

Homework:

Optional:

  • Clone this repo (DAT3) for easy access to the course files
  • Watch Introduction to Git and GitHub (36 minutes) to repeat a lot of today's presentation
  • Read the first two chapters of Pro Git for a much deeper understanding of version control and the basic Git commands
  • Learn some more Markdown and add it to your about.md file, then push those edits to GitHub and send another pull request
  • Read this friendly command line tutorial if you are brand new to the command line
  • For more project inspiration, browse the student projects from Andrew Ng's Machine Learning course at Stanford

Resources:

Class 3: Base Python

  • Any questions about Git/GitHub?
  • Discuss the course project. What's one thing you learned from reviewing student projects?
  • Base Python lesson, with exercises (code)

Homework:

Class 4: Getting and Cleaning Data

Homework:

Optional:

Resources:

Class 5: Exploratory Data Analysis

Homework:

Optional:

Resources:

  • For more web scraping with Beautiful Soup 4, here's a longer example: slides, code
  • Web scraping without writing any code: "turn any website into an API" with import.io or kimono
  • Simple examples of joins in Pandas, for when you need to merge multiple DataFrames together

Class 6: Linear Regression

  • Discuss your project question and data set
  • Pandas for visualization (code)
  • Linear regression (code, slides)
    • What is linear regression?
    • How to interpret the output?
    • What assumptions does linear regression depend upon?
    • What is multicollinearity and heteroskedasticity, and why should I care?
    • How do I represent categorical variables?

Optional:

  • Post your favorite visualization in the "viz" channel on Slack, and tell us what you like about it!

Resources:

  • For more on Pandas plotting, browse through this IPython notebook or read the visualization page from the official Pandas documentation
  • To learn how to customize your plots further, browse through this IPython notebook on matplotlib
  • To explore different types of visualizations and when to use them, Choosing a Good Chart is a handy one-page reference, and here is an excellent slide deck from Columbia's Data Mining class
  • If you are already a master of ggplot2 in R, you may prefer "ggplot for Python" over matplotlib: introduction, tutorial

Class 7: Linear Regression Part 2

  • Linear regression, continued

Homework:

  • Complete the exercises at the end of the python script from class

Resources:

Class 8: Machine Learning and KNN

  • Discuss homework solutions (code)
  • "Human learning" on iris data using Pandas (code)
  • Introduction to numpy (code)
  • Machine learning and K-Nearest Neighbors (slides)

Homework:

Optional:

  • Walk through the rest of the numpy reference and see if you can understand each of the functions

Resources:

  • For a more thorough introduction to numpy, this guide is quite good

Class 9: Model Evaluation

  • Introduction to scikit-learn with iris data (code)
  • Discuss the article on the bias-variance tradeoff
  • Model evaluation procedures (slides, code)
    • Training error
    • Underfitting and overfitting
    • Test set approach
    • Cross-validation
  • Model evaluation metrics (slides, code)
    • Confusion matrix
  • Introduction to Kaggle

Homework:

Optional:

Resources:

Class 10: Logistic Regression

  • Any questions from last time: model evaluation, Kaggle, article on Smart Autofill?
  • Summary of your feedback
  • Discuss your data exploration and analysis plan
  • Logistic Regression (slides, code)

Homework:

  • Continue to work on Part I of the exercise from class and submit your solution to DAT3-students

Class 11: Logistic Regression Part 2, Clustering

  • Logistic Regression, continued (exercise solution)
  • Clustering (slides)
    • Why cluster?
    • Introduction to the K-means algorithm

Homework:

  • Read through section 8.2 on K-means Clustering from Introduction to Data Mining by next Thursday. What are some of the strengths and limitations of k-means clustering?

Resources:

Class 12: Dimension Reduction

Homework:

  • Read Paul Graham's "A Plan for Spam" in preparation for Thursday's class on Naive Bayes

Resources:

Class 13: Clustering Part 2, Naive Bayes

Resources:

About

General Assembly's Data Science course in Washington, DC

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published