Skip to content

Commit 8c7008d

Browse files
committed
Initial migration from scrapylib codebase
Including Python 3 porting by @nyov from scrapinghub/scrapylib#67
1 parent aa83d0a commit 8c7008d

File tree

14 files changed

+307
-2
lines changed

14 files changed

+307
-2
lines changed

.bumpversion.cfg

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
[bumpversion]
2+
current_version = 0.1.0
3+
commit = True
4+
tag = True
5+
6+
[bumpversion:file:setup.py]
7+
8+
[bumpversion:file:scrapy_querycleaner/__init__.py]
9+

.coveragerc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[run]
2+
branch = true
3+
source = scrapy_querycleaner

.travis.yml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
language: python
2+
python: 3.5
3+
4+
sudo: false
5+
6+
env:
7+
matrix:
8+
- TOXENV=py27
9+
- TOXENV=py35
10+
11+
install: pip install -U tox codecov
12+
13+
script: tox
14+
15+
after_success:
16+
- codecov

CHANGES.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
Changes
2+
=======
3+
4+
5+
x.x.x (yyyy-mm-dd)
6+
------------------
7+
8+
Initial release.
9+
10+
This version is functionally equivalent to scrapylib's v1.7.0
11+
``scrapylib.querycleaner.QueryCleanerMiddleware``.
12+

README.md

Lines changed: 0 additions & 2 deletions
This file was deleted.

README.rst

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
===================
2+
scrapy-querycleaner
3+
===================
4+
5+
.. image:: https://travis-ci.org/scrapy-plugins/scrapy-querycleaner.svg?branch=master
6+
:target: https://travis-ci.org/scrapy-plugins/scrapy-querycleaner
7+
8+
.. image:: https://codecov.io/gh/scrapy-plugins/scrapy-querycleaner/branch/master/graph/badge.svg
9+
:target: https://codecov.io/gh/scrapy-plugins/scrapy-querycleaner
10+
11+
This is a Scrapy spider middleware to clean up the request URL GET query parameters
12+
at the output of the spider in accordance with the patterns provided by the user.
13+
14+
15+
Installation
16+
============
17+
18+
Install scrapy-querycleaner using ``pip``::
19+
20+
$ pip install scrapy-querycleaner
21+
22+
23+
Configuration
24+
=============
25+
26+
1. Add ``QueryCleanerMiddleware`` by including it in ``SPIDER_MIDDLEWARES``
27+
in your ``settings.py`` file::
28+
29+
SPIDER_MIDDLEWARES = {
30+
'scrapy_querycleaner.QueryCleanerMiddleware': 100,
31+
}
32+
33+
Here, priority ``100`` is just an example.
34+
Set its value depending on other middlewares you may have enabled already.
35+
36+
2. Enable the middleware using either ``QUERYCLEANER_REMOVE``
37+
or ``QUERYCLEANER_KEEP`` (or both) in your ``setting.py``.
38+
39+
40+
Usage
41+
=====
42+
43+
At least one of the following settings needs to be present for the
44+
middleware to be enabled.
45+
46+
47+
.. note::
48+
You can specify a list of parameter names by using the ``|`` (*OR*) regex operator.
49+
50+
For example, the pattern ``search|login|postid`` will match query parameters *search*,
51+
*login* and *postid*.
52+
This is by far the most common usage case.
53+
54+
And by setting ``QUERYCLEANER_REMOVE`` value to ``.*``
55+
you can completely remove all URL query parameters.
56+
57+
58+
Supported settings
59+
------------------
60+
61+
``QUERYCLEANER_REMOVE``
62+
a pattern (regular expression) that a query parameter name must match
63+
in order to be removed from the URL. (All the others will be accepted.)
64+
65+
``QUERYCLEANER_KEEP``
66+
a pattern that a query parameter name must match in order to be kept in the URL.
67+
(All the others will be removed.)
68+
69+
You can combine both if some query parameters patterns should be kept and some should not.
70+
71+
The **remove** pattern has precedence over the *keep* one.
72+
73+
74+
Example
75+
-------
76+
77+
Let's suppose that the spider extracts URLs like::
78+
79+
http://www.example.com/product.php?pid=135&cid=12&ttda=12
80+
81+
and we want to leave only the parameter ``pid``.
82+
83+
To achieve this objective we can use either ``QUERYCLEANER_REMOVE``
84+
or ``QUERYCLEANER_KEEP``:
85+
86+
- In the first case, the pattern would be ``cid|ttda``::
87+
88+
QUERYCLEANER_REMOVE = 'cid|ttda'
89+
90+
- In the second case, ``pid``::
91+
92+
QUERYCLEANER_KEEP = 'pid'
93+
94+
95+
The best solution depends on a particular case, that is,
96+
how the query filters will affect any other URL that the spider is expected to extract.

requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
scrapy>=1.0
2+
six

scrapy_querycleaner/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
from .middleware import QueryCleanerMiddleware
2+
3+
4+
__version__ = "0.1.0"

scrapy_querycleaner/middleware.py

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
"""Get parameter cleaner for AS.
2+
3+
Add removed/kept pattern (regex) with
4+
5+
QUERYCLEANER_REMOVE
6+
QUERYCLEANER_KEEP
7+
8+
Remove patterns has precedence.
9+
"""
10+
import re
11+
from six.moves.urllib.parse import quote
12+
from six import string_types
13+
14+
from scrapy.utils.httpobj import urlparse_cached
15+
from scrapy.http import Request
16+
from scrapy.exceptions import NotConfigured
17+
18+
from w3lib.url import _safe_chars
19+
20+
def _parse_query_string(query):
21+
"""Used for replacing cgi.parse_qsl.
22+
The cgi version returns the same pair for query 'key'
23+
and query 'key=', so reconstruction
24+
maps to the same string. But some sites does not handle both versions
25+
in the same way.
26+
This version returns (key, None) in the first case, and (key, '') in the
27+
second one, so correct reconstruction can be performed."""
28+
29+
params = query.split("&")
30+
keyvals = []
31+
for param in params:
32+
kv = param.split("=") + [None]
33+
keyvals.append((kv[0], kv[1]))
34+
return keyvals
35+
36+
def _filter_query(query, remove_re=None, keep_re=None):
37+
"""
38+
Filters query parameters in a query string according to key patterns
39+
>>> _filter_query('as=3&bs=8&cs=9')
40+
'as=3&bs=8&cs=9'
41+
>>> _filter_query('as=3&bs=8&cs=9', None, re.compile("as|bs"))
42+
'as=3&bs=8'
43+
>>> _filter_query('as=3&bs=8&cs=9', re.compile("as|bs"))
44+
'cs=9'
45+
>>> _filter_query('as=3&bs=8&cs=9', re.compile("as|bs"), re.compile("as|cs"))
46+
'cs=9'
47+
"""
48+
keyvals = _parse_query_string(query)
49+
qargs = []
50+
for k, v in keyvals:
51+
if remove_re is not None and remove_re.search(k):
52+
continue
53+
if keep_re is None or keep_re.search(k):
54+
qarg = quote(k, _safe_chars)
55+
if isinstance(v, string_types):
56+
qarg = qarg + '=' + quote(v, _safe_chars)
57+
qargs.append(qarg.replace("%20", "+"))
58+
return '&'.join(qargs)
59+
60+
class QueryCleanerMiddleware(object):
61+
def __init__(self, settings):
62+
remove = settings.get("QUERYCLEANER_REMOVE")
63+
keep = settings.get("QUERYCLEANER_KEEP")
64+
if not (remove or keep):
65+
raise NotConfigured
66+
self.remove = re.compile(remove) if remove else None
67+
self.keep = re.compile(keep) if keep else None
68+
69+
@classmethod
70+
def from_crawler(cls, crawler):
71+
return cls(crawler.settings)
72+
73+
def process_spider_output(self, response, result, spider):
74+
for res in result:
75+
if isinstance(res, Request):
76+
parsed = urlparse_cached(res)
77+
if parsed.query:
78+
parsed = parsed._replace(query=_filter_query(parsed.query, self.remove, self.keep))
79+
res = res.replace(url=parsed.geturl())
80+
yield res
81+

setup.cfg

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
[bdist_wheel]
2+
universal=1

0 commit comments

Comments
 (0)