farhangithub27
diff --git a/‎README.md‎
Lines changed: 8 additions & 11 deletions b/‎README.md‎
Lines changed: 8 additions & 11 deletions
diff --git a/‎code/07_api.py‎
Lines changed: 81 additions & 0 deletions b/‎code/07_api.py‎
Lines changed: 81 additions & 0 deletions
diff --git a/‎code/07_web_scraping.py‎
Lines changed: 204 additions & 0 deletions b/‎code/07_web_scraping.py‎
Lines changed: 204 additions & 0 deletions
diff --git a/‎data/example.html‎
Lines changed: 34 additions & 0 deletions b/‎data/example.html‎
Lines changed: 34 additions & 0 deletions
@@ -49,9 +49,7 @@ Tuesday | Thursday
 * [Feedback form](http://bit.ly/dat8feedback)
 * [Homework and project submissions](http://bit.ly/dat8homework)
 
-<!--
 ### [Comparison of machine learning models](other/model_comparison.md)
--->
 
 -----
 
@@ -149,7 +147,7 @@ Tuesday | Thursday
 ### Class 5: Visualization
 * Python homework with the Chipotle data due ([solution](code/03_python_homework_chipotle.py), [detailed explanation](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/03_python_homework_chipotle_explained.ipynb))
 * Part 2 of Exploratory Data Analysis with Pandas ([code](code/04_pandas.py))
-* Visualization with Pandas and Matplotlib ([code](code/05_pandas_visualization.py))
+* Visualization with Pandas and Matplotlib ([code](code/05_pandas_visualization.py), [notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/05_pandas_visualization.ipynb))
 
 **Homework:**
 * Your project question write-up is due on Thursday.
@@ -177,12 +175,11 @@ Tuesday | Thursday
 ### Class 6: Machine Learning
 * Part 2 of Visualization with Pandas and Matplotlib ([code](code/05_pandas_visualization.py))
 * Brief introduction to the Jupyter/IPython Notebook
-* Human learning exercise:
+* "Human learning" exercise:
     * [Iris dataset](http://archive.ics.uci.edu/ml/datasets/Iris) hosted by the UCI Machine Learning Repository
     * [Iris photo](http://sebastianraschka.com/Images/2014_python_lda/iris_petal_sepal.png)
     * [Notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/06_human_learning_iris.ipynb)
 * Introduction to machine learning ([slides](slides/06_machine_learning.pdf))
-* Machine learning exercise ([article](http://blog.dominodatalab.com/10-interesting-uses-of-data-science/))
 
 **Homework:**
 * **Optional:** Complete the bonus exercise listed in the [human learning notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/06_human_learning_iris.ipynb). It will take the place of any one homework you miss, past or future! This is due on Tuesday (9/8).
@@ -203,12 +200,11 @@ Tuesday | Thursday
 * If you would like to learn the IPython Notebook, the official [Notebook tutorials](http://nbviewer.ipython.org/github/ipython/ipython/blob/master/examples/Notebook/Index.ipynb) are useful.
 * This [Reddit discussion](https://www.reddit.com/r/Python/comments/3be5z2/do_you_prefer_ipython_notebook_over_ipython/) compares the relative strengths of the IPython Notebook and Spyder.
 
-<!--
-
 -----
 
 ### Class 7: Getting Data
 * Pandas homework with the IMDb data due (solution)
+* Optional "human learning" exercise with the iris data due (solution)
 * APIs ([code](code/07_api.py))
     * [OMDb API](http://www.omdbapi.com/)
 * Web scraping ([code](code/07_web_scraping.py))
@@ -239,9 +235,9 @@ Tuesday | Thursday
 -----
 
 ### Class 8: K-Nearest Neighbors
-* K-nearest neighbors and scikit-learn ([notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/08_knn_sklearn.ipynb), notebook code)
-* Exercise with NBA player data ([data](https://github.com/justmarkham/DAT4-students/blob/master/kerry/Final/NBA_players_2015.csv), [data dictionary](https://github.com/justmarkham/DAT-project-examples/blob/master/pdf/nba_paper.pdf), [notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/08_nba_knn.ipynb), notebook code)
-* Exploring the bias-variance tradeoff ([notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/08_bias_variance.ipynb), notebook code)
+* K-nearest neighbors and scikit-learn ([notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/08_knn_sklearn.ipynb))
+* Exercise with NBA player data ([notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/08_nba_knn.ipynb), [data](https://github.com/justmarkham/DAT4-students/blob/master/kerry/Final/NBA_players_2015.csv), [data dictionary](https://github.com/justmarkham/DAT-project-examples/blob/master/pdf/nba_paper.pdf))
+* Exploring the bias-variance tradeoff ([notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/08_bias_variance.ipynb))
 
 **Homework:**
 * Reading assignment on the [bias-variance tradeoff](homework/09_bias_variance.md)
@@ -261,6 +257,7 @@ Tuesday | Thursday
 * [Data visualization with Seaborn](https://beta.oreilly.com/learning/data-visualization-with-seaborn) is a quick tour of some of the popular types of Seaborn plots.
 * [Visualizing Google Forms Data with Seaborn](http://pbpython.com/pandas-google-forms-part2.html) and [How to Create NBA Shot Charts in Python](http://savvastjortjoglou.com/nba-shot-sharts.html) are both good examples of Seaborn usage on real-world data.
 
+<!--
 
 -----
 
@@ -270,7 +267,7 @@ Tuesday | Thursday
     * Discuss assigned readings: [introduction](http://www.dataschool.io/reproducibility-is-not-just-for-researchers/), [Colbert Report video](http://thecolbertreport.cc.com/videos/dcyvro/austerity-s-spreadsheet-error), [cabs article](http://iquantny.tumblr.com/post/107245431809/how-software-in-half-of-nyc-cabs-generates-5-2), [Tweet](https://twitter.com/jakevdp/status/519563939177197571), [creating a reproducible analysis](https://github.com/jtleek/datasharing)
     * Examples: [Classic rock](https://github.com/fivethirtyeight/data/tree/master/classic-rock), [student project 1](https://github.com/jwknobloch/DAT4_final_project), [student project 2](https://github.com/justmarkham/DAT4-students/tree/master/Jonathan_Bryan/Project_Files)
 * Discuss the reading assignment on the [bias-variance tradeoff](homework/09_bias_variance.md)
-* Model evaluation using train/test split ([notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/09_model_evaluation.ipynb), notebook code)
+* Model evaluation using train/test split ([notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/09_model_evaluation.ipynb))
 
 **Homework:**
 * If you're brand new to linear regression, read [What is Linear Regression?](http://blog.yhathq.com/posts/what-is-linear-regression.html) and watch [The Easiest Introduction to Regression Analysis](https://www.youtube.com/watch?v=k_OB1tWX9PM) (14 minutes).
 
@@ -0,0 +1,81 @@
+'''
+CLASS: Getting Data from APIs
+
+What is an API?
+- Application Programming Interface
+- Structured way to expose specific functionality and data access to users
+- Web APIs usually follow the "REST" standard
+
+How to interact with a REST API:
+- Make a "request" to a specific URL (an "endpoint"), and get the data back in a "response"
+- Most relevant request method for us is GET (other methods: POST, PUT, DELETE)
+- Response is often JSON format
+- Web console is sometimes available (allows you to explore an API)
+'''
+
+# read IMDb data into a DataFrame: we want a year column!
+
+# use requests library to interact with a URL
+
+# check the status: 200 means success, 4xx means error
+
+# view the raw response text
+
+# decode the JSON response body into a dictionary
+
+# extracting the year from the dictionary
+
+# what happens if the movie name is not recognized?
+
+# define a function to return the year
+
+# test the function
+
+# create a smaller DataFrame for testing
+
+# write a for loop to build a list of years
+
+# check that the DataFrame and the list of years are the same length
+
+# save that list as a new column
+
+'''
+Bonus content: Updating the DataFrame as part of a loop
+'''
+
+# enumerate allows you to access the item location while iterating
+letters = ['a', 'b', 'c']
+for index, letter in enumerate(letters):
+    print index, letter
+
+# iterrows method for DataFrames is similar
+for index, row in top_movies.iterrows():
+    print index, row.title
+
+# create a new column and set a default value
+movies['year'] = -1
+
+# loc method allows you to access a DataFrame element by 'label'
+movies.loc[0, 'year'] = 1994
+
+# write a for loop to update the year for the first three movies
+for index, row in movies.iterrows():
+    if index < 3:
+        movies.loc[index, 'year'] = get_movie_year(row.title)
+        sleep(1)
+    else:
+        break
+
+'''
+Other considerations when accessing APIs:
+- Most APIs require you to have an access key (which you should store outside your code)
+- Most APIs limit the number of API calls you can make (per day, hour, minute, etc.)
+- Not all APIs are free
+- Not all APIs are well-documented
+- Pay attention to the API version
+
+Python wrapper is another option for accessing an API:
+- Set of functions that "wrap" the API code for ease of use
+- Potentially simplifies your code
+- But, wrapper could have bugs or be out-of-date or poorly documented
+'''
@@ -0,0 +1,204 @@
+'''
+CLASS: Web Scraping with Beautiful Soup
+
+What is web scraping?
+- Extracting information from websites (simulates a human copying and pasting)
+- Based on finding patterns in website code (usually HTML)
+
+What are best practices for web scraping?
+- Scraping too many pages too fast can get your IP address blocked
+- Pay attention to the robots exclusion standard (robots.txt)
+- Let's look at http://www.imdb.com/robots.txt
+
+What is HTML?
+- Code interpreted by a web browser to produce ("render") a web page
+- Let's look at example.html
+- Tags are opened and closed
+- Tags have optional attributes
+
+How to view HTML code:
+- To view the entire page: "View Source" or "View Page Source" or "Show Page Source"
+- To view a specific part: "Inspect Element"
+- Safari users: Safari menu, Preferences, Advanced, Show Develop menu in menu bar
+- Let's inspect example.html
+'''
+
+# read the HTML code for a web page and save as a string
+with open('example.html', 'rU') as f:
+    html = f.read()
+
+# convert HTML into a structured Soup object
+from bs4 import BeautifulSoup
+b = BeautifulSoup(html)
+
+# print out the object
+print b
+print b.prettify()
+
+# 'find' method returns the first matching Tag (and everything inside of it)
+b.find(name='body')
+b.find(name='h1')
+
+# Tags allow you to access the 'inside text'
+b.find(name='h1').text
+
+# Tags also allow you to access their attributes
+b.find(name='h1')['id']
+
+# 'find_all' method is useful for finding all matching Tags
+b.find(name='p')        # returns a Tag
+b.find_all(name='p')    # returns a ResultSet (like a list of Tags)
+
+# ResultSets can be sliced like lists
+len(b.find_all(name='p'))
+b.find_all(name='p')[0]
+b.find_all(name='p')[0].text
+b.find_all(name='p')[0]['id']
+
+# iterate over a ResultSet
+results = b.find_all(name='p')
+for tag in results:
+    print tag.text
+
+# limit search by Tag attribute
+b.find(name='p', attrs={'id':'scraping'})
+b.find_all(name='p', attrs={'class':'topic'})
+
+# limit search to specific sections
+b.find_all(name='li')
+b.find(name='ul', attrs={'id':'scraping'}).find_all(name='li')
+
+'''
+EXERCISE ONE
+'''
+
+# find the 'h2' tag and then print its text
+
+# find the 'p' tag with an 'id' value of 'feedback' and then print its text
+
+# find the first 'p' tag and then print the value of the 'id' attribute
+
+# print the text of all four resources
+
+# print the text of only the API resources
+
+'''
+Scraping the IMDb website
+'''
+
+# get the HTML from the Shawshank Redemption page
+
+# convert HTML into Soup
+
+# run this code if you have encoding errors
+
+# get the title
+
+# get the star rating
+
+'''
+EXERCISE TWO
+'''
+
+# get the description
+
+# get the content rating
+
+# get the duration in minutes (as an integer)
+
+'''
+OPTIONAL WEB SCRAPING HOMEWORK
+
+First, define a function that accepts an IMDb ID and returns a dictionary of
+movie information: title, star_rating, description, content_rating, duration.
+The function should gather this information by scraping the IMDb website, not
+by calling the OMDb API. (This is really just a wrapper of the web scraping
+code we wrote above.)
+
+For example, get_movie_info('tt0111161') should return:
+
+{'content_rating': 'R',
+ 'description': u'Two imprisoned men bond over a number of years...',
+ 'duration': 142,
+ 'star_rating': 9.3,
+ 'title': u'The Shawshank Redemption'}
+
+Then, open the file imdb_ids.txt using Python, and write a for loop that builds
+a list in which each element is a dictionary of movie information.
+
+Finally, convert that list into a DataFrame.
+'''
+
+
+
+'''
+Another IMDb example: Getting the genres
+'''
+
+# read the Shawshank Redemption page again
+r = requests.get('http://www.imdb.com/title/tt0111161/')
+b = BeautifulSoup(r.text)
+
+# only gets the first genre
+b.find(name='span', attrs={'class':'itemprop', 'itemprop':'genre'})
+
+# gets all of the genres
+b.find_all(name='span', attrs={'class':'itemprop', 'itemprop':'genre'})
+
+# stores the genres in a list
+[tag.text for tag in b.find_all(name='span', attrs={'class':'itemprop', 'itemprop':'genre'})]
+
+'''
+Another IMDb example: Getting the writers
+'''
+
+# attempt to get the list of writers (too many results)
+b.find_all(name='span', attrs={'itemprop':'name'})
+
+# limit search to a smaller section to only get the writers
+b.find(name='div', attrs={'itemprop':'creator'}).find_all(name='span', attrs={'itemprop':'name'})
+
+'''
+Another IMDb example: Getting the URLs of cast images
+'''
+
+# find the images by size
+results = b.find_all(name='img', attrs={'height':'44', 'width':'32'})
+
+# check that the number of results matches the number of cast images on the page
+len(results)
+
+# iterate over the results to get all URLs
+for tag in results:
+    print tag['loadlate']
+
+'''
+Useful to know: Alternative Beautiful Soup syntax
+'''
+
+# read the example web page again
+with open('example.html', 'rU') as f:
+    html = f.read()
+
+# convert to Soup
+b = BeautifulSoup(html)
+
+# these are equivalent
+b.find(name='p')    # normal way
+b.find('p')         # 'name' is the first argument
+b.p                 # can also be accessed as an attribute of the object
+
+# these are equivalent
+b.find(name='p', attrs={'id':'scraping'})   # normal way
+b.find('p', {'id':'scraping'})              # 'name' and 'attrs' are the first two arguments
+b.find('p', id='scraping')                  # can write the attributes as arguments
+
+# these are equivalent
+b.find(name='p', attrs={'class':'topic'})   # normal way
+b.find('p', class_='topic')                 # 'class' is special, so it needs a trailing underscore
+b.find('p', 'topic')                        # if you don't name it, it's assumed to be the class
+
+# these are equivalent
+b.find_all(name='p')    # normal way
+b.findAll(name='p')     # old function name from Beautiful Soup 3
+b('p')                  # if you don't name the method, it's assumed to be find_all
@@ -0,0 +1,34 @@
+<!DOCTYPE html>
+<html lang='en'>
+
+<head>
+    <title>Example Web Page</title>
+</head>
+
+<body>
+
+    <h1 id='main'>DAT8 Class 7</h1>
+
+    <p class='topic' id='api'>First, we are covering APIs, which are useful for getting data.</p>
+    <p class='topic' id='scraping'>Then, we are covering web scraping, which is a more flexible way to get data.</p>
+    <p class='topic' id='feedback'>Finally, I will ask you to fill out yet another feedback form!</p>
+
+    <h2>Resource List</h2>
+
+    <p>Here are some helpful API resources:</p>
+
+    <ul id='api'>
+        <li>API resource 1</li>
+        <li>API resource 2</li>
+    </ul>
+
+    <p>Here are some helpful web scraping resources:</p>
+
+    <ul id='scraping'>
+        <li>Web scraping resource 1</li>
+        <li>Web scraping resource 2</li>
+    </ul>
+
+</body>
+
+</html>