Skip to content

Commit f9b59dd

Browse files
bpo-46337: document SchemeClass behavior
This functionality was exposed in 53c6ccc.
1 parent eee880c commit f9b59dd

1 file changed

Lines changed: 74 additions & 15 deletions

File tree

Doc/library/urllib.parse.rst

Lines changed: 74 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,8 @@ Resource Locators. It supports the following URL schemes: ``file``, ``ftp``,
2525
``gopher``, ``hdl``, ``http``, ``https``, ``imap``, ``mailto``, ``mms``,
2626
``news``, ``nntp``, ``prospero``, ``rsync``, ``rtsp``, ``rtspu``, ``sftp``,
2727
``shttp``, ``sip``, ``sips``, ``snews``, ``svn``, ``svn+ssh``, ``telnet``,
28-
``wais``, ``ws``, ``wss``.
28+
``wais``, ``ws``, ``wss``. The behavior of other schemes may be controlled with
29+
a collection of ``UrlClass`` enums passed to dependent functions.
2930

3031
The :mod:`urllib.parse` module defines functions that fall into two broad
3132
categories: URL parsing and URL quoting. These are covered in detail in
@@ -37,24 +38,33 @@ URL Parsing
3738
The URL parsing functions focus on splitting a URL string into its components,
3839
or on combining URL components into a URL string.
3940

40-
.. function:: urlparse(urlstring, scheme='', allow_fragments=True)
41+
.. function:: urlparse(urlstring, scheme='', allow_fragments=True, classes=set())
4142

42-
Parse a URL into six components, returning a 6-item :term:`named tuple`. This
43-
corresponds to the general structure of a URL:
44-
``scheme://netloc/path;parameters?query#fragment``.
45-
Each tuple item is a string, possibly empty. The components are not broken up
46-
into smaller parts (for example, the network location is a single string), and %
43+
Parse a URL into six components with respect to given scheme classes,
44+
returning a 6-item :term:`named tuple`. This corresponds to the general
45+
structure of a URL: ``scheme://netloc/path;parameters?query#fragment``. Each
46+
tuple item is a string, possibly empty. The components are not broken up into
47+
smaller parts (for example, the network location is a single string), and %
4748
escapes are not expanded. The delimiters as shown above are not part of the
48-
result, except for a leading slash in the *path* component, which is retained if
49-
present. For example:
49+
result, except for a leading slash in the *path* component, which is retained
50+
if present.
51+
52+
The scheme of the URL determines whether or not parameters are parsed as
53+
distinct from the path. To override the scheme and parse parameters anyway,
54+
pass a set containing ``SchemeClass.PARAMS``.
55+
56+
For example:
5057

5158
.. doctest::
5259
:options: +NORMALIZE_WHITESPACE
5360

54-
>>> from urllib.parse import urlparse
61+
>>> from urllib.parse import urlparse, SchemeClass
5562
>>> urlparse("scheme://netloc/path;parameters?query#fragment")
5663
ParseResult(scheme='scheme', netloc='netloc', path='/path;parameters', params='',
5764
query='query', fragment='fragment')
65+
>>> urlparse("scheme://netloc/path;parameters?query#fragment", classes=[SchemeClass.PARAMS])
66+
ParseResult(scheme='scheme', netloc='netloc', path='/path',
67+
params=';parameters', query='query', fragment='fragment')
5868
>>> o = urlparse("http://docs.python.org:80/3/library/urllib.parse.html?"
5969
... "highlight=params#url-parsing")
6070
>>> o
@@ -348,19 +358,21 @@ or on combining URL components into a URL string.
348358
with an empty query; the RFC states that these are equivalent).
349359

350360

351-
.. function:: urljoin(base, url, allow_fragments=True)
361+
.. function:: urljoin(base, url, allow_fragments=True, classes=set())
352362

353363
Construct a full ("absolute") URL by combining a "base URL" (*base*) with
354-
another URL (*url*). Informally, this uses components of the base URL, in
355-
particular the addressing scheme, the network location and (part of) the
356-
path, to provide missing components in the relative URL. For example:
364+
another URL (*url*), and with behavior given by a set of ``SchemeClass``
365+
enums. Informally, this uses components of the base URL, in particular the
366+
addressing scheme, the network location and (part of) the path, to provide
367+
missing components in the relative URL. For example:
357368

358369
>>> from urllib.parse import urljoin
359370
>>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
360371
'http://www.cwi.nl/%7Eguido/FAQ.html'
361372

362373
The *allow_fragments* argument has the same meaning and default as for
363-
:func:`urlparse`.
374+
:func:`urlparse`. As in :func:`urlparse`, a ``SchemeClass`` set may be given
375+
to override behavior inferred by the scheme.
364376

365377
.. note::
366378

@@ -543,6 +555,53 @@ operating on :class:`bytes` or :class:`bytearray` objects:
543555

544556
.. versionadded:: 3.2
545557

558+
Special URL Behaviors and Scheme Classes
559+
----------------------------------------
560+
561+
:mod:`urllib.parse` recognizes three special properties of URLs, namely relative
562+
addressing (used in, for instance, the ``ftp``, ``http``, or ``gopher``
563+
protocols), netloc-sensitive resolution (used in the ``ftp``, ``http``, or
564+
``git`` protocols), and URLs that may contain parameters (for instance, ``ftp``
565+
or ``telnet``).
566+
567+
Relative addressing allows resolution of relative URLs, and netloc-sensitive
568+
addressing allows resolution with respect to the netloc (domain name) of a URL.
569+
As HTTP URLs have both behaviors by default, this is demonstrated in the
570+
following example:
571+
572+
>>> from urllib.parse import urljoin
573+
>>> urljoin('http://example.org/post/x', '../y')
574+
'http://example.org/post/y'
575+
576+
Additionally, if it is not indicated that a URL is sensitive to parameters
577+
(those specified after a semicolon in the path), then they'll be treated as part
578+
of the path rather than as a distinct component.
579+
580+
Without specifying optional parameters or modifying global variables, Python
581+
will guess what parameters to apply based on the scheme. Schemes associated with
582+
each are specified by three lists in :mod:`urllib.parse`:
583+
584+
* ``urllib.uses_relative``
585+
* ``urllib.uses_netloc``
586+
* ``urllib.uses_params``
587+
588+
In addition, any function that takes a ``classes`` parameter (for instance,
589+
:func:`urlparse` and :func:`urljoin`) may override the behavior of the uses
590+
lists, for instance, parsing a custom or widely unused scheme with the same
591+
behavior as that of HTTP:
592+
593+
>>> from urllib.parse import urljoin, SchemeClass
594+
>>> urljoin(
595+
'my-protocol://example.org/post/x', '../y',
596+
classes=[SchemeClass.NETLOC, SchemeClass.RELATIVE])
597+
'http://example.org/post/y'
598+
599+
For reference, the following three scheme classes are present (exactly
600+
corresponding to the uses lists):
601+
602+
* ``urllib.SchemeClass.RELATIVE``
603+
* ``urllib.SchemeClass.NETLOC``
604+
* ``urllib.SchemeClass.PARAMS``
546605

547606
URL Quoting
548607
-----------

0 commit comments

Comments
 (0)