Skip to content

Commit 855edb7

Browse files
committed
add classes 7 and 8
1 parent 5796bb6 commit 855edb7

15 files changed

Lines changed: 1383 additions & 11 deletions

README.md

Lines changed: 8 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -49,9 +49,7 @@ Tuesday | Thursday
4949
* [Feedback form](http://bit.ly/dat8feedback)
5050
* [Homework and project submissions](http://bit.ly/dat8homework)
5151

52-
<!--
5352
### [Comparison of machine learning models](other/model_comparison.md)
54-
-->
5553

5654
-----
5755

@@ -149,7 +147,7 @@ Tuesday | Thursday
149147
### Class 5: Visualization
150148
* Python homework with the Chipotle data due ([solution](code/03_python_homework_chipotle.py), [detailed explanation](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/03_python_homework_chipotle_explained.ipynb))
151149
* Part 2 of Exploratory Data Analysis with Pandas ([code](code/04_pandas.py))
152-
* Visualization with Pandas and Matplotlib ([code](code/05_pandas_visualization.py))
150+
* Visualization with Pandas and Matplotlib ([code](code/05_pandas_visualization.py), [notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/05_pandas_visualization.ipynb))
153151

154152
**Homework:**
155153
* Your project question write-up is due on Thursday.
@@ -177,12 +175,11 @@ Tuesday | Thursday
177175
### Class 6: Machine Learning
178176
* Part 2 of Visualization with Pandas and Matplotlib ([code](code/05_pandas_visualization.py))
179177
* Brief introduction to the Jupyter/IPython Notebook
180-
* Human learning exercise:
178+
* "Human learning" exercise:
181179
* [Iris dataset](http://archive.ics.uci.edu/ml/datasets/Iris) hosted by the UCI Machine Learning Repository
182180
* [Iris photo](http://sebastianraschka.com/Images/2014_python_lda/iris_petal_sepal.png)
183181
* [Notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/06_human_learning_iris.ipynb)
184182
* Introduction to machine learning ([slides](slides/06_machine_learning.pdf))
185-
* Machine learning exercise ([article](http://blog.dominodatalab.com/10-interesting-uses-of-data-science/))
186183

187184
**Homework:**
188185
* **Optional:** Complete the bonus exercise listed in the [human learning notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/06_human_learning_iris.ipynb). It will take the place of any one homework you miss, past or future! This is due on Tuesday (9/8).
@@ -203,12 +200,11 @@ Tuesday | Thursday
203200
* If you would like to learn the IPython Notebook, the official [Notebook tutorials](http://nbviewer.ipython.org/github/ipython/ipython/blob/master/examples/Notebook/Index.ipynb) are useful.
204201
* This [Reddit discussion](https://www.reddit.com/r/Python/comments/3be5z2/do_you_prefer_ipython_notebook_over_ipython/) compares the relative strengths of the IPython Notebook and Spyder.
205202

206-
<!--
207-
208203
-----
209204

210205
### Class 7: Getting Data
211206
* Pandas homework with the IMDb data due (solution)
207+
* Optional "human learning" exercise with the iris data due (solution)
212208
* APIs ([code](code/07_api.py))
213209
* [OMDb API](http://www.omdbapi.com/)
214210
* Web scraping ([code](code/07_web_scraping.py))
@@ -239,9 +235,9 @@ Tuesday | Thursday
239235
-----
240236

241237
### Class 8: K-Nearest Neighbors
242-
* K-nearest neighbors and scikit-learn ([notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/08_knn_sklearn.ipynb), notebook code)
243-
* Exercise with NBA player data ([data](https://github.com/justmarkham/DAT4-students/blob/master/kerry/Final/NBA_players_2015.csv), [data dictionary](https://github.com/justmarkham/DAT-project-examples/blob/master/pdf/nba_paper.pdf), [notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/08_nba_knn.ipynb), notebook code)
244-
* Exploring the bias-variance tradeoff ([notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/08_bias_variance.ipynb), notebook code)
238+
* K-nearest neighbors and scikit-learn ([notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/08_knn_sklearn.ipynb))
239+
* Exercise with NBA player data ([notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/08_nba_knn.ipynb), [data](https://github.com/justmarkham/DAT4-students/blob/master/kerry/Final/NBA_players_2015.csv), [data dictionary](https://github.com/justmarkham/DAT-project-examples/blob/master/pdf/nba_paper.pdf))
240+
* Exploring the bias-variance tradeoff ([notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/08_bias_variance.ipynb))
245241

246242
**Homework:**
247243
* Reading assignment on the [bias-variance tradeoff](homework/09_bias_variance.md)
@@ -261,6 +257,7 @@ Tuesday | Thursday
261257
* [Data visualization with Seaborn](https://beta.oreilly.com/learning/data-visualization-with-seaborn) is a quick tour of some of the popular types of Seaborn plots.
262258
* [Visualizing Google Forms Data with Seaborn](http://pbpython.com/pandas-google-forms-part2.html) and [How to Create NBA Shot Charts in Python](http://savvastjortjoglou.com/nba-shot-sharts.html) are both good examples of Seaborn usage on real-world data.
263259

260+
<!--
264261
265262
-----
266263
@@ -270,7 +267,7 @@ Tuesday | Thursday
270267
* Discuss assigned readings: [introduction](http://www.dataschool.io/reproducibility-is-not-just-for-researchers/), [Colbert Report video](http://thecolbertreport.cc.com/videos/dcyvro/austerity-s-spreadsheet-error), [cabs article](http://iquantny.tumblr.com/post/107245431809/how-software-in-half-of-nyc-cabs-generates-5-2), [Tweet](https://twitter.com/jakevdp/status/519563939177197571), [creating a reproducible analysis](https://github.com/jtleek/datasharing)
271268
* Examples: [Classic rock](https://github.com/fivethirtyeight/data/tree/master/classic-rock), [student project 1](https://github.com/jwknobloch/DAT4_final_project), [student project 2](https://github.com/justmarkham/DAT4-students/tree/master/Jonathan_Bryan/Project_Files)
272269
* Discuss the reading assignment on the [bias-variance tradeoff](homework/09_bias_variance.md)
273-
* Model evaluation using train/test split ([notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/09_model_evaluation.ipynb), notebook code)
270+
* Model evaluation using train/test split ([notebook](http://nbviewer.ipython.org/github/justmarkham/DAT8/blob/master/notebooks/09_model_evaluation.ipynb))
274271
275272
**Homework:**
276273
* If you're brand new to linear regression, read [What is Linear Regression?](http://blog.yhathq.com/posts/what-is-linear-regression.html) and watch [The Easiest Introduction to Regression Analysis](https://www.youtube.com/watch?v=k_OB1tWX9PM) (14 minutes).

code/07_api.py

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
'''
2+
CLASS: Getting Data from APIs
3+
4+
What is an API?
5+
- Application Programming Interface
6+
- Structured way to expose specific functionality and data access to users
7+
- Web APIs usually follow the "REST" standard
8+
9+
How to interact with a REST API:
10+
- Make a "request" to a specific URL (an "endpoint"), and get the data back in a "response"
11+
- Most relevant request method for us is GET (other methods: POST, PUT, DELETE)
12+
- Response is often JSON format
13+
- Web console is sometimes available (allows you to explore an API)
14+
'''
15+
16+
# read IMDb data into a DataFrame: we want a year column!
17+
18+
# use requests library to interact with a URL
19+
20+
# check the status: 200 means success, 4xx means error
21+
22+
# view the raw response text
23+
24+
# decode the JSON response body into a dictionary
25+
26+
# extracting the year from the dictionary
27+
28+
# what happens if the movie name is not recognized?
29+
30+
# define a function to return the year
31+
32+
# test the function
33+
34+
# create a smaller DataFrame for testing
35+
36+
# write a for loop to build a list of years
37+
38+
# check that the DataFrame and the list of years are the same length
39+
40+
# save that list as a new column
41+
42+
'''
43+
Bonus content: Updating the DataFrame as part of a loop
44+
'''
45+
46+
# enumerate allows you to access the item location while iterating
47+
letters = ['a', 'b', 'c']
48+
for index, letter in enumerate(letters):
49+
print index, letter
50+
51+
# iterrows method for DataFrames is similar
52+
for index, row in top_movies.iterrows():
53+
print index, row.title
54+
55+
# create a new column and set a default value
56+
movies['year'] = -1
57+
58+
# loc method allows you to access a DataFrame element by 'label'
59+
movies.loc[0, 'year'] = 1994
60+
61+
# write a for loop to update the year for the first three movies
62+
for index, row in movies.iterrows():
63+
if index < 3:
64+
movies.loc[index, 'year'] = get_movie_year(row.title)
65+
sleep(1)
66+
else:
67+
break
68+
69+
'''
70+
Other considerations when accessing APIs:
71+
- Most APIs require you to have an access key (which you should store outside your code)
72+
- Most APIs limit the number of API calls you can make (per day, hour, minute, etc.)
73+
- Not all APIs are free
74+
- Not all APIs are well-documented
75+
- Pay attention to the API version
76+
77+
Python wrapper is another option for accessing an API:
78+
- Set of functions that "wrap" the API code for ease of use
79+
- Potentially simplifies your code
80+
- But, wrapper could have bugs or be out-of-date or poorly documented
81+
'''

code/07_web_scraping.py

Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
'''
2+
CLASS: Web Scraping with Beautiful Soup
3+
4+
What is web scraping?
5+
- Extracting information from websites (simulates a human copying and pasting)
6+
- Based on finding patterns in website code (usually HTML)
7+
8+
What are best practices for web scraping?
9+
- Scraping too many pages too fast can get your IP address blocked
10+
- Pay attention to the robots exclusion standard (robots.txt)
11+
- Let's look at http://www.imdb.com/robots.txt
12+
13+
What is HTML?
14+
- Code interpreted by a web browser to produce ("render") a web page
15+
- Let's look at example.html
16+
- Tags are opened and closed
17+
- Tags have optional attributes
18+
19+
How to view HTML code:
20+
- To view the entire page: "View Source" or "View Page Source" or "Show Page Source"
21+
- To view a specific part: "Inspect Element"
22+
- Safari users: Safari menu, Preferences, Advanced, Show Develop menu in menu bar
23+
- Let's inspect example.html
24+
'''
25+
26+
# read the HTML code for a web page and save as a string
27+
with open('example.html', 'rU') as f:
28+
html = f.read()
29+
30+
# convert HTML into a structured Soup object
31+
from bs4 import BeautifulSoup
32+
b = BeautifulSoup(html)
33+
34+
# print out the object
35+
print b
36+
print b.prettify()
37+
38+
# 'find' method returns the first matching Tag (and everything inside of it)
39+
b.find(name='body')
40+
b.find(name='h1')
41+
42+
# Tags allow you to access the 'inside text'
43+
b.find(name='h1').text
44+
45+
# Tags also allow you to access their attributes
46+
b.find(name='h1')['id']
47+
48+
# 'find_all' method is useful for finding all matching Tags
49+
b.find(name='p') # returns a Tag
50+
b.find_all(name='p') # returns a ResultSet (like a list of Tags)
51+
52+
# ResultSets can be sliced like lists
53+
len(b.find_all(name='p'))
54+
b.find_all(name='p')[0]
55+
b.find_all(name='p')[0].text
56+
b.find_all(name='p')[0]['id']
57+
58+
# iterate over a ResultSet
59+
results = b.find_all(name='p')
60+
for tag in results:
61+
print tag.text
62+
63+
# limit search by Tag attribute
64+
b.find(name='p', attrs={'id':'scraping'})
65+
b.find_all(name='p', attrs={'class':'topic'})
66+
67+
# limit search to specific sections
68+
b.find_all(name='li')
69+
b.find(name='ul', attrs={'id':'scraping'}).find_all(name='li')
70+
71+
'''
72+
EXERCISE ONE
73+
'''
74+
75+
# find the 'h2' tag and then print its text
76+
77+
# find the 'p' tag with an 'id' value of 'feedback' and then print its text
78+
79+
# find the first 'p' tag and then print the value of the 'id' attribute
80+
81+
# print the text of all four resources
82+
83+
# print the text of only the API resources
84+
85+
'''
86+
Scraping the IMDb website
87+
'''
88+
89+
# get the HTML from the Shawshank Redemption page
90+
91+
# convert HTML into Soup
92+
93+
# run this code if you have encoding errors
94+
95+
# get the title
96+
97+
# get the star rating
98+
99+
'''
100+
EXERCISE TWO
101+
'''
102+
103+
# get the description
104+
105+
# get the content rating
106+
107+
# get the duration in minutes (as an integer)
108+
109+
'''
110+
OPTIONAL WEB SCRAPING HOMEWORK
111+
112+
First, define a function that accepts an IMDb ID and returns a dictionary of
113+
movie information: title, star_rating, description, content_rating, duration.
114+
The function should gather this information by scraping the IMDb website, not
115+
by calling the OMDb API. (This is really just a wrapper of the web scraping
116+
code we wrote above.)
117+
118+
For example, get_movie_info('tt0111161') should return:
119+
120+
{'content_rating': 'R',
121+
'description': u'Two imprisoned men bond over a number of years...',
122+
'duration': 142,
123+
'star_rating': 9.3,
124+
'title': u'The Shawshank Redemption'}
125+
126+
Then, open the file imdb_ids.txt using Python, and write a for loop that builds
127+
a list in which each element is a dictionary of movie information.
128+
129+
Finally, convert that list into a DataFrame.
130+
'''
131+
132+
133+
134+
'''
135+
Another IMDb example: Getting the genres
136+
'''
137+
138+
# read the Shawshank Redemption page again
139+
r = requests.get('http://www.imdb.com/title/tt0111161/')
140+
b = BeautifulSoup(r.text)
141+
142+
# only gets the first genre
143+
b.find(name='span', attrs={'class':'itemprop', 'itemprop':'genre'})
144+
145+
# gets all of the genres
146+
b.find_all(name='span', attrs={'class':'itemprop', 'itemprop':'genre'})
147+
148+
# stores the genres in a list
149+
[tag.text for tag in b.find_all(name='span', attrs={'class':'itemprop', 'itemprop':'genre'})]
150+
151+
'''
152+
Another IMDb example: Getting the writers
153+
'''
154+
155+
# attempt to get the list of writers (too many results)
156+
b.find_all(name='span', attrs={'itemprop':'name'})
157+
158+
# limit search to a smaller section to only get the writers
159+
b.find(name='div', attrs={'itemprop':'creator'}).find_all(name='span', attrs={'itemprop':'name'})
160+
161+
'''
162+
Another IMDb example: Getting the URLs of cast images
163+
'''
164+
165+
# find the images by size
166+
results = b.find_all(name='img', attrs={'height':'44', 'width':'32'})
167+
168+
# check that the number of results matches the number of cast images on the page
169+
len(results)
170+
171+
# iterate over the results to get all URLs
172+
for tag in results:
173+
print tag['loadlate']
174+
175+
'''
176+
Useful to know: Alternative Beautiful Soup syntax
177+
'''
178+
179+
# read the example web page again
180+
with open('example.html', 'rU') as f:
181+
html = f.read()
182+
183+
# convert to Soup
184+
b = BeautifulSoup(html)
185+
186+
# these are equivalent
187+
b.find(name='p') # normal way
188+
b.find('p') # 'name' is the first argument
189+
b.p # can also be accessed as an attribute of the object
190+
191+
# these are equivalent
192+
b.find(name='p', attrs={'id':'scraping'}) # normal way
193+
b.find('p', {'id':'scraping'}) # 'name' and 'attrs' are the first two arguments
194+
b.find('p', id='scraping') # can write the attributes as arguments
195+
196+
# these are equivalent
197+
b.find(name='p', attrs={'class':'topic'}) # normal way
198+
b.find('p', class_='topic') # 'class' is special, so it needs a trailing underscore
199+
b.find('p', 'topic') # if you don't name it, it's assumed to be the class
200+
201+
# these are equivalent
202+
b.find_all(name='p') # normal way
203+
b.findAll(name='p') # old function name from Beautiful Soup 3
204+
b('p') # if you don't name the method, it's assumed to be find_all

data/example.html

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
<!DOCTYPE html>
2+
<html lang='en'>
3+
4+
<head>
5+
<title>Example Web Page</title>
6+
</head>
7+
8+
<body>
9+
10+
<h1 id='main'>DAT8 Class 7</h1>
11+
12+
<p class='topic' id='api'>First, we are covering APIs, which are useful for getting data.</p>
13+
<p class='topic' id='scraping'>Then, we are covering web scraping, which is a more flexible way to get data.</p>
14+
<p class='topic' id='feedback'>Finally, I will ask you to fill out yet another feedback form!</p>
15+
16+
<h2>Resource List</h2>
17+
18+
<p>Here are some helpful API resources:</p>
19+
20+
<ul id='api'>
21+
<li>API resource 1</li>
22+
<li>API resource 2</li>
23+
</ul>
24+
25+
<p>Here are some helpful web scraping resources:</p>
26+
27+
<ul id='scraping'>
28+
<li>Web scraping resource 1</li>
29+
<li>Web scraping resource 2</li>
30+
</ul>
31+
32+
</body>
33+
34+
</html>

0 commit comments

Comments
 (0)