@@ -5,10 +5,10 @@ Web Scraping
55------------
66
77Web sites are written using HTML, which means that each web page is a
8- structured document. Sometimes it would be great to obtain some data from
9- them and preserve the structure while we're at it, but this isn't always easy
10- - it 's not often that web sites provide their data in comfortable formats
11- such as `.csv `.
8+ structured document. Sometimes it would be great to obtain some data from
9+ them and preserve the structure while we're at it, but this isn't always easy.
10+ It 's not often that web sites provide their data in comfortable formats
11+ such as `` .csv ` `.
1212
1313This is where web scraping comes in. Web scraping is the practice of using
1414computer program to sift through a web page and gather the data that you need
1818----
1919
2020`lxml <http://lxml.de/ >`_ is a pretty extensive library written for parsing
21- XML and HTML documents, which you can easily install using `pip `. We will
22- be using its `html ` module to get data from this web page: `econpy <http://econpy.pythonanywhere.com/ex/001.html>' _ .
21+ XML and HTML documents, which you can easily install using `` pip ` `. We will
22+ be using its `html ` module to get data from this web page: `econpy <http://econpy.pythonanywhere.com/ex/001.html >` _ .
2323
2424First we shall import the required modules:
2525
@@ -28,8 +28,8 @@ First we shall import the required modules:
2828 from lxml import html
2929 from urllib2 import urlopen
3030
31- We will use `urllib2.urlopen` to retrieve the web page with our data and
32- parse it using the `html ` module:
31+ We will use `` urllib2.urlopen ` ` to retrieve the web page with our data and
32+ parse it using the `` html ` ` module:
3333
3434.. code-block :: python
3535
@@ -39,7 +39,7 @@ parse it using the `html` module:
3939 `tree ` now contains the whole HTML file in a nice tree structure which
4040we can go over in many different ways, one of which is using XPath. XPath
4141is a way of locating information in structured documents such as HTML or XML
42- pages. A good introduction to XPath is ' here <http://www.w3schools.com/xpath/default.asp>' _ .
42+ pages. A good introduction to XPath is ` here <http://www.w3schools.com/xpath/default.asp >` _ .
4343One can also use various tools for obtaining the XPath of elements such as
4444FireBug for Firefox or in Chrome you can right click an element, choose
4545'Inspect element', highlight the code and the right click again and choose
@@ -65,6 +65,7 @@ Lets see what we got exactly:
6565 print ' Prices: ' , prices
6666
6767 ::
68+
6869 Buyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes',
6970 'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff',
7071 'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup',
0 commit comments