Skip to content

Commit 32dea94

Browse files
committed
Using requests instead of urllib2, final draft.
1 parent a22a6e9 commit 32dea94

File tree

1 file changed

+12
-10
lines changed

1 file changed

+12
-10
lines changed

docs/scenarios/scrape.rst

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -14,27 +14,29 @@ This is where web scraping comes in. Web scraping is the practice of using
1414
computer program to sift through a web page and gather the data that you need
1515
in a format most useful to you.
1616

17-
lxml
18-
----
17+
lxml and Requests
18+
-----------------
1919

2020
`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
21-
XML and HTML documents, which you can easily install using ``pip``. We will
22-
be using its ``html`` module to get example data from this web page: `econpy.org <http://econpy.pythonanywhere.com/ex/001.html>`_ .
21+
XML and HTML documents really fast. It even handles messed up tags. We will
22+
also be using the `Requests <http://docs.python-requests.org/en/latest/>`_ module instead of the already built-in urlib2
23+
due to improvements in speed and readability. You can easily install both
24+
using ``pip install lxml`` and ``pip install requests``.
2325

24-
First we shall import the required modules:
26+
Lets start with the imports:
2527

2628
.. code-block:: python
2729
2830
from lxml import html
29-
from urllib2 import urlopen
31+
import requests
3032
31-
We will use ``urllib2.urlopen`` to retrieve the web page with our data and
32-
parse it using the ``html`` module:
33+
Next we will use ``requests.get`` to retrieve the web page with our data
34+
and parse it using the ``html`` module and save the results in ``tree``:
3335

3436
.. code-block:: python
3537
36-
page = urlopen('http://econpy.pythonanywhere.com/ex/001.html')
37-
tree = html.fromstring(page.read())
38+
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
39+
tree = html.fromstring(page.text)
3840
3941
``tree`` now contains the whole HTML file in a nice tree structure which
4042
we can go over two different ways: XPath and CSSSelect. In this example, I

0 commit comments

Comments
 (0)