Skip to content

Commit 83c9cba

Browse files
committed
Added a bit more code to improve understanding.
1 parent c3d7bdd commit 83c9cba

File tree

1 file changed

+20
-6
lines changed

1 file changed

+20
-6
lines changed

docs/scenarios/scrape.rst

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -36,19 +36,29 @@ parse it using the ``html`` module:
3636
page = urlopen('http://econpy.pythonanywhere.com/ex/001.html')
3737
tree = html.fromstring(page.read())
3838
39-
`tree` now contains the whole HTML file in a nice tree structure which
40-
we can go over in many different ways, one of which is using XPath. XPath
41-
is a way of locating information in structured documents such as HTML or XML
42-
pages. A good introduction to XPath is `here <http://www.w3schools.com/xpath/default.asp>`_ .
39+
``tree`` now contains the whole HTML file in a nice tree structure which
40+
we can go over two different ways: XPath and CSSSelect. In this example, I
41+
will focus on the former.
42+
43+
XPath is a way of locating information in structured documents such as
44+
HTML or XML pages. A good introduction to XPath is `here <http://www.w3schools.com/xpath/default.asp>`_ .
45+
4346
One can also use various tools for obtaining the XPath of elements such as
4447
FireBug for Firefox or in Chrome you can right click an element, choose
4548
'Inspect element', highlight the code and the right click again and choose
4649
'Copy XPath'.
4750

4851
After a quick analysis, we see that in our page the data is contained in
4952
two elements - one is a div with title 'buyer-name' and the other is a
50-
span with class 'item-price'. Knowing this we can create the correct XPath
51-
query and use the lxml `xpath` function like this:
53+
span with class 'item-price':
54+
55+
.. code-bloc:: html
56+
57+
<div title="buyer-name">Carson Busses</div>
58+
<span class="item-price">$29.95</span>
59+
60+
Knowing this we can create the correct XPath query and use the lxml
61+
``xpath`` function like this:
5262

5363
.. code-block:: python
5464
@@ -81,3 +91,7 @@ Congratulations! We have successfully scraped all the data we wanted from
8191
a web page using lxml and we have it stored in memory as two lists. Now we
8292
can either continue our work on it, analyzing it using python or we can
8393
export it to a file and share it with friends.
94+
95+
A cool idea to think about is writing a script to iterate through the rest
96+
of the pages of this example data set or making this application use
97+
threads to improve its speed.

0 commit comments

Comments
 (0)