python-hyper / hyperlink Goto Github PK

View Code? Open in Web Editor NEW

281.0 18.0 40.0 1.08 MB

🔗 Immutable, Pythonic, correct URLs.

Home Page: https://hyperlink.readthedocs.io/

License: Other

Python 100.00%

url url-parsing rfc-3986 python internet

hyperlink's Introduction

Hyperlink

Cool URLs that don't change.

Hyperlink provides a pure-Python implementation of immutable URLs. Based on RFC 3986 and 3987, the Hyperlink URL makes working with both URIs and IRIs easy.

Hyperlink is tested against Python 2.7, 3.4, 3.5, 3.6, 3.7, 3.8, and PyPy.

Full documentation is available on Read the Docs.

Installation

Hyperlink is a pure-Python package and requires nothing but Python. The easiest way to install is with pip:

pip install hyperlink

Then, hyperlink away!

from hyperlink import URL

url = URL.from_text(u'http://github.com/python-hyper/hyperlink?utm_source=README')
utm_source = url.get(u'utm_source')
better_url = url.replace(scheme=u'https', port=443)
org_url = better_url.click(u'.')

See the full API docs on Read the Docs.

More information

Hyperlink would not have been possible without the help of Glyph Lefkowitz and many other community members, especially considering that it started as an extract from the Twisted networking library. Thanks to them, Hyperlink's URL has been production-grade for well over a decade.

Still, should you encounter any issues, do file an issue, or submit a pull request.

hyperlink's People

Contributors

Stargazers

Watchers

Forkers

markrwilliams mineo toabctl funkyhat julian hilarak adamchainz hugovk barleyj changhe160 maxguillen moreati sbobade mentorembedded pythonthings interdoc-edit-bot hristog strategist922 xawery94 vroncevic cannylin sthagen shivasiddharth twm rwaycachedlibs damiencarol euresti p-unity-lineage jasonkhadka cclauss python-repository-hub exarkun ppfranco kenballus ladi sysfce2 wolfi-chainguard-demo chainguard-wolfi-bites-back

hyperlink's Issues

Document DecodedURL

There don't seem to be API docs for DecodedURL. As far as I can see it's only mentioned in the docstring of hyperlink.parse (which also mentions EncodedURL without explaining that URL is EncodedURL).

URL's escaping behavior is inconsistent between path manipulation functions and querystring manipulators. The former escape, while the latter validate input:

Python 3.7.7 (default, Mar 10 2020, 15:16:38) 
>>> from hyperlink import URL, DecodedURL
>>> u = URL.from_text('https://example.com')
>>> u.child('foo/bar')
URL.from_text('https://example.com/foo%2Fbar')
>>> u.add('foo', '&')
Traceback (most recent call last):
  ...
ValueError: one or more reserved delimiters &# present in query parameter value: '&'

The documentation of add(), set(), etc. should at least mention this validation.

DecodedURL has consistent behavior: both escape as required:

>>> du = DecodedURL.from_text('https://example.com')
>>> du.child('foo/bar')
DecodedURL(url=URL.from_text('https://example.com/foo%2Fbar'))
>>> du.add('foo', '&')
DecodedURL(url=URL.from_text('https://example.com/?foo=%26'))

I find this behavior less surprising.

Recommended practice for adding reserved characters?

Per @markrwilliams comment here and a few others dotted around, we're facing a design gap in hyperlink's APIs.

To paraphrase:

url = URL()
url.add(u'param', u'#value').to_text()

Yields:

ValueError: one or more reserved delimiters &# present in query parameter value: u'#value'

This is due to a subtle shift in hyperlink's design compared to twisted.python.url. t.p.url would allow any string value in, whereas hyperlink prefers to store the "minimally-encoded" version. This is why a ValueError is raised from the code above.

Technically, this can be solved by making the code url.add(u'param', _encode_query_part(u'#value')). But hyperlink's primary goal is to handle encoding/decoding, does it really make sense to push that back on the user?

One solution Mark and I discussed would be to switch to decoding every value passed in. But what if someone were to pass in u'%23%' and actually intend for that to be their decoded value? And the API would be further complicated by the fact that the underlying decoding is generally unknown. UTF8, Latin-1, and plain old binary are all valid in percent-encoded URL parts. Autodecoding UTF8 might have better usability most of the time, but much like relying on Python 2's implicit encoding/decoding, the safety of the explicit _encode_*_part() is probably preferable.

It might occur to one that this entire problem bears some resemblance to the bytes/unicode split, as URL has URL.to_uri() and URL.to_iri(). There is some truth to this, but both IRIs and URIs are both URLs. Having two types imposes a sort of artificial split I'd like to avoid if possible, but we also don't have a good way to represent an already decoded IRI. This was causing an issue with double decoding on multiple .to_iri() calls (see #16).

Right now my best idea is to enable that technical solution above by exposing the various encoding and decoding functions as public APIs, since those may prove useful utilities for other contexts anyways. I'm sure there are better ideas, too, so I'm going to leave this issue open as a place for discussion on handling this quandary.

Usage Question: trailing slashes

Thanks for a great library!

Is there a recommended way to ensure a URL path ends with a slash? Or—conversely—to ensure a URL path does not end with slash?

Some sites will attempt to redirect you if you're missing a slash or if you've included one at the end of the path section of the URL. In a situation where you're accepting the path as user input, it becomes useful to anticipate this and avoid human error.

While Python string manipulation allows me to achieve my goal, I was wondering if the library provides a way of handling this.

Thanks!

Contrast with yarl in FAQ?

I noticed that the FAQ entry doesn't mention yarl.

On the surface, both libs seem quite similar, both do the immutable thing, and are IRI-capable. After a (very cursory) review, the only differences I've been able to see are minor API flavor things, e.g. hyperlink.URL(**components) vs yarl.URL.build(**components).

bad link in README

This can't be right:

See the full API docs [here](#).

`asText` needs to do some re-escaping of path segments

>>> a = hyperlink.URL.fromText("urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob")
>>> b = hyperlink.URL.fromText(a.asText())
>>> a == b
True
>>> c = hyperlink.URL.fromText(a.asIRI().asText())
>>> c == a
False
>>> c == b
False

If there are colons (maybe other characters too) in the path segment of a relative URI, they need to be escaped on output; .asText() should never produce a syntactically invalid URI.

Ship PEP 561 type stubs

Type hints can be very useful to clients.

github repo should have link to docs at the top

the project description at the top of the github page has a homepage field. it should be https://hyperlink.readthedocs.io/ :)

Raise if host field contains '/'

These two URLs render the same but are not the same:

URL(scheme=u'http', userinfo=u'', host=u'a', port=80, path=(u'c',), query=(), fragment=u'', rooted=True)
URL(scheme=u'http', userinfo=u'', host=u'a/c', port=80, path=(), query=(), fragment=u'', rooted=True)

Which is baffling until you look at all of the fields separately.

Two URLs that render as the same text but are not equal

I can create two URLs that render as the same text but are not equal:

>>> from hyperlink import URL
>>> a = URL(path=(u"?",))
>>> b = URL(path=(u"%3F",))
>>> a.asText()
u'%3F'
>>> b.asText()
u'%3F'
>>> a.asText() == b.asText()
True
>>> a == b
False

remove "uses_netloc" parameter and attribute to .init and .replace

In twisted.python.url I tried very hard to eliminate the use of the confusing, vague, and antiquated term "netloc". (NB: neither https://tools.ietf.org/html/rfc3986 nor https://url.spec.whatwg.org includes the string "netloc"). I would therefore like to eliminate its use in uses_netloc as well.

uses_netloc describes an attribute of a scheme. However, in its current incarnation it can be made nonsensically inconsistent with the scheme it's actually using. Additionally, schemes also have other attributes, such as a name, and a default port number. (Possibly more in the future?) As such, the presence of the scheme registry makes URL objects kinda/sorta mutable after the fact.

My suggestion for a replacement would be for URL to reference a structured scheme object; the external scheme registry could then be a collection of such objects. If a URL is created before the registry is populated, its scheme object could then be inconsistent with the global one, but it could be replaced. At this point, I believe schemes would just have the three attributes given above (i.e. the parameters to registerScheme.

rooted flag exposes serialization ambiguity

This should not be possible:

>>> import hyperlink
>>> a = hyperlink.URL(path=['', 'foo'], rooted=False)
>>> b = hyperlink.URL(path=['foo'], rooted=True)
>>> a == b
False
>>> a
URL.from_text('/foo')
>>> b
URL.from_text('/foo')
>>> a.normalize() == b.normalize()
False

I think that if path ever starts with a '' when rooted=False, we ought to flip the flag and remove the path segment.

normalizing "free radical" percent signs

I think that

>>> hyperlink.URL(path=['%%%']).normalize()
URL.from_text('%%%')

ought to be giving me %25%25%25 ?

I haven't managed to dredge up a spec reference for this, but a % in the path without 2 hex digits after it seems like it ought to just be quoted.

Rooted flag still causing serialization oddities

I thought this was the same bug as #90, but @glyph's #91 didn't fix it, so I guess it's at least slightly distinct.

_________ TestURL.test_reproduce_my_rooted_oddity _________

self = <hyperlink.test.test_url.TestURL testMethod=test_reproduce_my_rooted_oddity>

    def test_reproduce_my_rooted_oddity(self):
        a = URL(scheme='udp', port=4900)
        b = URL.from_text('udp://:4900')
        assert str(a) == str(b)
        assert a.asText() == b.asText()
>       assert a == b
E       AssertionError: assert URL.from_text('udp://:4900') == URL.from_text('udp://:4900')
E         -URL.from_text('udp://:4900')
E         +URL.from_text('udp://:4900')

src/hyperlink/test/test_url.py:1101: AssertionError

That assertion error is entertainingly befuddling 😃

After a cursory inspection of a and b there, the only differences I found in the instance __dict__ were _rooted and _uses_netloc.

The same failure happens with URL(scheme='udp', host='', port=4900), but doing URL(scheme='udp', port=4900, rooted=True) (with or without host) makes the test pass.

(this was tested on 688233a)

use a dictionary for query params

hyperlink does not support dictionaries for query. while i know that technically this is an orderedmultidict, this is just not nice from an api perspective:

>>> url = URL.from_text('https://example.org/api/v2')

>>> url.replace(query={'foo': 'bar'})
[...]
ValueError: too many values to unpack (expected 2)

instead, this works:

>>> url.replace(query=[('foo', 'bar')])
URL.from_text('https://example.org/api/v2?foo=bar')

or using .items():

>>> url.replace(query={'foo': 'bar'}.items())
URL.from_text('https://example.org/api/v2?foo=bar')

but tbh none of this looks appealing to me. :)

any reason for the current behaviour?

Example does not work with python 2.7

I am trying to execute the example given in the README.md. With python 3.5 it works just fine, with python 2.7 it fails with the following error. Did I miss something ?

Python 2.7.13 (default, Jan 12 2017, 13:55:14) 
[GCC 6.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from hyperlink import URL
>>> url = URL.from_text('http://github.com/mahmoud/hyperlink?utm_source=README')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/hyperlink/_url.py", line 971, in from_text
    um = _URL_RE.match(_textcheck('text', text))
  File "/usr/lib/python2.7/site-packages/hyperlink/_url.py", line 430, in _textcheck
    raise TypeError('expected %s for %s, got %r' % (exp, name, value))
TypeError: expected unicode for text, got 'http://github.com/mahmoud/hyperlink?utm_source=README'
>>>

URL with empty path and path = "/" are the same

URL believes this falsehood:

URL.from_text(u'http://0/') != URL.from_text(u'http://0')

It should be educated about this matter.

why is str() the same as repr()?

>>> url = URL.from_text('https://example.org/api/v2')

>>> str(url)
"URL.from_text('https://example.org/api/v2')"

>>> repr(url)
"URL.from_text('https://example.org/api/v2')"

why is str() not just the url?

https://example.org/api/v2

the requests library "stringifies" urls (by design). in practice that means that for instance furl instances can be passed directly. this is not possible with hyperlink:

>>> from hyperlink import URL
>>> import requests

>>> url = URL.from_text('https://example.org/api/v2')
>>> requests.get(url)
[...]
InvalidSchema: No connection adapters were found for 'URL.from_text('https://example.org/api/v2')'

License unclear

Since cffbeba, the LICENSE file contains the MIT license, but https://github.com/python-hyper/hyperlink/blob/master/setup.py#L17 still says BSD.

remove "family" attribute

I similarly ( #25 ) don't understand what this is for. Can it just be removed?

DecodedURL.click() only permits unicode or EncodedURL argument

Calling DecodedURL.click(url) raises a TypeError for DecodedURL arguments. A solution to this might be to check for DecodedURL and pass url._url into the wrapped click method, or to define a common base class for the two URL types and type-check that. I'm not sure whether interchanging between EncodedURL and DecodedURL is valid semantically, however.

to_unicode() is harmful

I've had some version of this bug more than once:

>>> from hyperlink import URL
>>> URL.from_text(b"/foo")
URL.from_text("b'/foo'")

Note the bytes were interpreted as "b'/foo'" instead of the (almost certainly) more-expected "/foo".

This is because URL.from_text(s) calls to_unicode(s) which calls unicode(s).

This "works" in Python 2 and does the above in Python 3.

It seems to me that raising an exception insofar as from_text should be given text.

urls will break on 'narrow' python 2.7-3.3 builds

Found this package via the twisted mailing list, was surprised to see a nexus of glyph, mahmoud and cory so had to peek...

I just found/fixed an issue in a another python package, tldextract, and this one is afflicted by the same issue as well.

I'll link to my bug report and fix for the other project below (which i documented a lot and reference PEPs to), but here's the TLDR:

Python2.7 through 3.3 store unicode data as UCS-2 or UCS-4, it's a compile-time option and UCS-2 is the default. Python 3.3+ stores unicode data differently. a lot of linux/mac distributions shipped with a UCS-2 compile option.
UC-2 is considered a "narrow" build and has a max character range of 65535

import sys
print sys.maxunicode
many punycode encodings will break during unicode decoding...

import hyperlink
url = 'https://xn--vi8hiv.ws'
obj = hyperlink.URL.from_text(url)
print obj.to_iri()
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Catch that error? it's a ValueError, not a UnicodeEncodeError.

Fixing that is pretty easy... https://github.com/mahmoud/hyperlink/blob/master/hyperlink/_url.py#L1012-L1015

     try:
         asciiHost = self.host.encode("ascii")
     except UnicodeEncodeError:
         textHost = self.host
+    except ValueError:
+        textHost = self.host

but it may make sense to add some more tests with a broken domain (like the f*&^#*&^s link emoji url above, which has broken my indexer an ungodly number of times)

everything is explained in detail here:
john-kurkowski/tldextract#122

and here:
https://github.com/john-kurkowski/tldextract/pull/130/files

what if my password has a reserved delimiter in it?

Right now if I do this:

>>> from hyperlink import URL
>>> example = URL.fromText("https://example.com/")
>>> example.replace(userinfo="alpha:my#password")

I get this:

Traceback (most recent call last):
  File "<input>", line 1, in <module>
    example.replace(userinfo="alpha:my#password")
  File "/Users/glyph/.virtualenvs/tmp-a0b3197f7a1da77/lib/python3.6/site-package
s/hyperlink/_url.py", line 861, in replace
    userinfo=_optional(userinfo, self.userinfo),
  File "/Users/glyph/.virtualenvs/tmp-a0b3197f7a1da77/lib/python3.6/site-package
s/hyperlink/_url.py", line 615, in __init__
    self._userinfo = _textcheck("userinfo", userinfo, '/?#@')
  File "/Users/glyph/.virtualenvs/tmp-a0b3197f7a1da77/lib/python3.6/site-package
s/hyperlink/_url.py", line 410, in _textcheck
    % (''.join(delims), name, value))
ValueError: one or more reserved delimiters /?#@ present in userinfo: 'alpha:my#
password'

The API for setting a secret is sort of silly already (manually sticking a colon into the userinfo string), so "fixing" this might not involve any behavior change, but rather add a new argument (.replace(secret=...?)) but there should be some way to take a string that a user typed into a password box and embed it into the URL somehow without forcing the caller to do any wacky percent encoding of their own.

remove "parse_host" API

What is the purpose of this API being public? It doesn't seem to connect with anything else in the docs.

Allow URL query encoding to be overridden

Hey, just FYI: hyperlink encodes query to UTF-8 before escaping (_encode_query_part function); this is incorrect, as query part should be encoded to page encoding before percent-escaping. See https://url.spec.whatwg.org/#url-query-string.

add a dependency on incremental for deprecation

Since there's a bunch of stuff I am proposing removing ( #24 #25 #26 ) we should probably have a way of emitting some warnings and generally tracking when things need to be removed. https://pypi.python.org/pypi/eventually has some functionality that can do that. (This issue doesn't really stand on its own; we don't need it until we've got agreement & implementation on one of those other issues.)

to_iri() does not decode stray percents

From the specification, URIs should encode stray percents with %25.

However, when calling to_iri(), these encoded percents are not decoded.
Is this a bug?

URL takes any type as input

URL takes any type as input. This often is wrong, most notably with bytes in Python 3:

>>> URL.fromText(b"/foo").asText()
"b'/foo'"
>>> URL.fromText(object()).asText()
'%3Cobject%20object%20at%200x100538160%3E'

Raising TypeError seems more useful in these cases.

idiomatic way to extend paths

it seems hyperlink does not have a nice way to add path components to a base url, which is a very common operation when interacting with rest apis.

for example, consider

>>> from hyperlink import URL
>>> base_url = URL.from_text('https://example.org/api/v2')

let's assume i want to build a url for an endpoint below this base url, e.g. users/search.

this works:

>>> base_url.replace(path=base_url.path + ('users', 'search'))
URL.from_text('https://example.org/api/v2/users/search')

... but let's face it, this is not so nice:

users/search (likely copied from some documentation) needs manual splitting into a tuple
base_url is referenced twice since the .path tuple is required for the concatenation

making this nicer is not really possible with the current api. for example:

>>> base_url.replace(path=base_url.path + 'users/search'.split('/'))
[...]
TypeError: can only concatenate tuple (not "list") to tuple

oops. well, that's easy to work around:

>>> base_url.replace(path=base_url.path + tuple('users/search'.split('/')))
URL.from_text('https://example.org/api/v2/users/search')

...but the end result is even uglier.

it would be great if a use case like this is handled with a nicer api. thoughts?

URLs don't support fromText -> toURI with URLs containing IPv6 literals

>>> URL.fromText(u"http://[3fff::1]/foo").asURI().asText()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hawkowl/venvs/commands/lib/python2.7/site-packages/hyperlink/_url.py", line 1338, in to_uri
    new_host = self.host if not self.host else idna_encode(self.host, uts46=True).decode("ascii")
  File "/home/hawkowl/venvs/commands/lib/python2.7/site-packages/idna/core.py", line 340, in encode
    s = uts46_remap(s, std3_rules, transitional)
  File "/home/hawkowl/venvs/commands/lib/python2.7/site-packages/idna/core.py", line 332, in uts46_remap
    _unot(code_point), pos + 1, repr(domain)))
idna.core.InvalidCodepoint: Codepoint U+003A not allowed at position 5 in u'3fff::1'

RFC2397 Data URIs

Percent-encoding isn't limited by utf-8 or any other underlying encoding, and thus can represent pretty much any data. RFC2397 takes advantage of this to jam whatever data you want, along with a mimetype into a URL.

If usage is common enough, the implementation doesn't overcomplicate, I think this might make sense for a built-in hyperlink feature. The first step is definitely to research how broadly this is used.

reserved characters are treated inconsistently and not sensibly preserved

This has been a design flaw since the inception of the library, so, mea culpa on that.

Fundamentally, preserving, escaping, and encoding "reserved" characters is entirely the URL object's job, and it's failing at that. Possibly the most succinct demonstration of the problem is this:

>>> u = URL()
>>> u = u.child(u'/')
>>> u = u.asIRI()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    u = u.asIRI()
  File
"/Users/glyph/Library/Python/2.7/lib/python/site-packages/hyperlink/_url.py",
line 1116, in to_iri
    fragment=_percent_decode(self.fragment))
  File
"/Users/glyph/Library/Python/2.7/lib/python/site-packages/hyperlink/_url.py",
line 861, in replace
    userinfo=_optional(userinfo, self.userinfo),
  File
"/Users/glyph/Library/Python/2.7/lib/python/site-packages/hyperlink/_url.py",
line 606, in __init__
    for segment in path))
  File
"/Users/glyph/Library/Python/2.7/lib/python/site-packages/hyperlink/_url.py",
line 606, in <genexpr>
    for segment in path))
  File
"/Users/glyph/Library/Python/2.7/lib/python/site-packages/hyperlink/_url.py",
line 410, in _textcheck
    % (''.join(delims), name, value))
ValueError: one or more reserved delimiters /?# present in path segment: u'/'
>>>

This is - obviously I hope - the wrong place to be failing with an error like this.

There was previously some attempt to preserve these characters in the data model and escape them only upon stringification, but d26814c wrecked these semantics. (In fairness: the attempt to do this was broken, and there are some places, like the scheme, where certain characters indeed cannot be represented, so this direction isn't entirely wrong.)

Fundamentally if a user wants to encode slashes, question marks, hash signs or whatever else that a human might, for example, type into a text field, then it should be possible to do that.

We could fix this obvious manifestation of the problem by just putting back the escape-only-on-asText logic, but that still leaves an even more pernicious problem:

>>> u = URL(path=tuple([u'%2525']))
>>> u.asText()
u'%2525'
>>> u.asIRI().asText()
u'%25'
>>> u.asIRI().asIRI().asText()
u'%'
>>>

Clearly, multiple trips through asIRI should not be un-escaping the escape character - the idea is that .asIRI() is a normalization step, that should be idempotent upon subsequent calls.

For the moment, I'm not sure exactly what the correct fix is here, but the property I'd really like to preserve is that for any x,

URL.fromText(URL().child(x).<as many asIRI()s or asURI()s as you want>.asText()).<as many .asIRI()s as you want, although possibly not .asURI()s>.segments[0] == x

Excessive escaping of "=" in query string parameter values

As a more specific continuation of the discussion in #11, it would seem that the = character is yet another special case. While = is a meaningful character in the query string, separating keys and values, only the first = does that.

Digging in further, empty query parameter keys are OK. And equals signs in the value of query parameter values are OK.

# Werkzeug request object for "GET http://localhost:5000/?=x=x=x="
# from Firefox and Chrome
(Pdb) request.args
ImmutableMultiDict([('', u'x=x=x=')])

Seen here, and in their developer tools, Firefox and Chrome do not encode the equals signs. On the server side, Werkzeug is ok with this.

Now, urllib does, but I think this is only because their implementation is lazy. :)

Can twisted.python.url.URL be switched over to hyperlink.DecodedURL by default?

Just wanted to open a discussion. Previously, Twisted's URL didn't mind much if reserved characters were added in values (see #6 and #8). Hyperlink's URL changed that, with all that entailed (see #44). Now, DecodedURL would allow a return to all-characters allowed, with the necessary escaping happening automatically.

Would it make sense for DecodedURL to become Twisted's primary URL? It wasn't designed to, but it's pretty close, with at least one exception (userinfo is a tuple instead of a :-separated string).

@glyph @markrwilliams @wsanchez, thoughts?

The hosted documentation includes no API documentation

The API documentation page is essentially empty of API documentation:

https://hyperlink.readthedocs.io/en/latest/api.html

Unused variable (attempted_rooted_replacement) in test case

test_url.py has this line:

attempted_rooted_replacement = normal_absolute.replace(rooted=True)

The variable attempted_rooted_replacement is unused and will be removed in #86 if merged.

It should be re-added as something should be asserted about it, assuming it was there intentionally.

Add support for IPv6 Zone Identifiers

Due to hyperlink using socket.inet_pton() to parse IPv6, IPv6 zone identifiers aren't supported.

>>> hyperlink.URL.from_text(u'https://[fe80:3438:7667:5c77:ce27%eth0]')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "hyperlink/_url.py", line 1135, in from_text
    rooted, userinfo, uses_netloc)
  File "hyperlink/_url.py", line 797, in __init__
    _, self._host = parse_host(_textcheck('host', host, '/?#@'))
  File "hyperlink/_url.py", line 697, in parse_host
    raise URLParseError('invalid IPv6 host: %r (%r)' % (host, se))
hyperlink._url.URLParseError: invalid IPv6 host: u'fe80:3438:7667:5c77:ce27%eth0' (error('illegal IP address string passed to inet_pton',))

While inet_pton() may not support them, other socket module functions do. This is probably best addressed by switching to regex-based parsing of hosts, which also happens to address hyperlink's Windows-specific branch (socket module on Windows Py2 doesn't support inet_pton).

Extend `URL().remove()` to support removing a single key-value pair

I have some URLs that use multiple query parameters to track filters, e.g.:

/documents?filter=apple&filter=banana&filter=coconut

If I parse this with URL(), I can use remove("filter") to remove all the instances of this query parameter, but removing just the "banana" filter is a little more fiddly.

How would you feel about extending remove() to support passing an optional query parameter value to remove, so you only remove that one pair (if it exists)?

Something like:

def remove(self, name, value=None):
    if value is None:
        new_query = ((k, v) for (k, v) in self.query if k != name)
    else:
        new_query = ((k, v) for (k, v) in self.query if k != name and v != name)
    return self.replace(query=new_query)

reserved chars escaping is not compatible with what browsers do

This is weird, but browsers don't usually escape |; Chrome only escapes it in URL path, but not in query; Firefox doesn't escape it at all. See scrapy/w3lib#25 (comment).

Add an API for resolving a URL against another

Hi!

url.click appears to take a str. It seems like it would be nice to have an API that takes a URL instead.

(For the same reasons why dealing with URLs and not strs is better in the first place :)

URL().child() fails with a TypeError

python -c 'from hyperlink import URL; URL.from_text(u"http://example.com").child()'

produces

Traceback (most recent call last):
  File "<module>", line 1, in <module>
  File "/Users/Julian/.local/share/virtualenvs/dev/site-packages/hyperlink/_url.py", line 1099, in child
    else None] + new_segs
TypeError: unsupported operand type(s) for +: 'tuple' and 'list'

probably would be nice for that to just return the original URL. If it's intentionally an error though, one that says that the argument is required would be nice.

Document technical design precedence

The URL is a complicated structure with a long history. Not all implementations and standards agree on the finer details.

Per discussion in #38, I need to add a new FAQ or section to the Design document that details the design goals in terms of precedence.

In general terms, the current precedence is RFC3986 (and other non-obsolete RFCs), browser behavior (particularly Firefox and Chrome), Twisted community practices, and somewhere down the line WHATWG. Some explanation of this is in order.

New release?

It's been a year, and the current metadata shows no support for 3.7-3.8

cc @mahmoud @glyph

Latest Hyperlink breaks Twisted on Python 3's test suite

[FAIL]
Traceback (most recent call last):
  File "/buildslave/fedora25-py3.5-coverage/Twisted/build/py35-alldeps-withcov-posix/lib/python3.5/site-packages/twisted/python/test/test_url.py", line 750, in test_invalidArguments
    check("scheme")
  File "/buildslave/fedora25-py3.5-coverage/Twisted/build/py35-alldeps-withcov-posix/lib/python3.5/site-packages/twisted/python/test/test_url.py", line 749, in check
    assertRaised(raised, expectation, param)
  File "/buildslave/fedora25-py3.5-coverage/Twisted/build/py35-alldeps-withcov-posix/lib/python3.5/site-packages/twisted/python/test/test_url.py", line 744, in assertRaised
    name, "<unexpected>"))
  File "/buildslave/fedora25-py3.5-coverage/Twisted/build/py35-alldeps-withcov-posix/lib/python3.5/site-packages/twisted/trial/_synctest.py", line 432, in assertEqual
    super(_Assertions, self).assertEqual(first, second, msg)
  File "/usr/lib64/python3.5/unittest/case.py", line 837, in assertEqual
    assertion_func(first, second, msg=msg)
  File "/usr/lib64/python3.5/unittest/case.py", line 1210, in assertMultiLineEqual
    self.fail(self._formatMessage(msg, standardMsg))
twisted.trial.unittest.FailTest: 'expected unicode for scheme, got <unexpected>' != 'expected str for scheme, got <unexpected>'
- expected unicode for scheme, got <unexpected>
?          ^^^^^^^
+ expected str for scheme, got <unexpected>
?          ^^^


twisted.python.test.test_url.TestURL.test_invalidArguments

Related to d26814c changing the exception handling

cc @mahmoud

Add license file to the PyPI tarball

Current tarball doesn't include LICENSE file. It is important for Buildroot distribution in order to gather related legal info of all packages.

When are you going to make a new release?

Contextualize scheme registration API

hyperlink.register_scheme mutates global state. That's convenient but amounts to an import time side effect. If I do this in module a:

from hyperlink import register_scheme
register_scheme("blah")

I have to do this in module b:

from hyperlink import URL
import a

u = URL.from_text("blah://blah.com")

Scheme registration should be localized, returning a new URL-like object that knows about the registered schemes. That would let a look like this:

import hyperlink.URL as _URL

URL = _URL.schemes({"blah": "blah"})

So that b could do this:

from a import URL

u = URL.from_text({"blah": "blah"})

A context manager might be useful, too:

with URL.schemes({"blah": "blah"}) as blah_url:
    u = blah_url.from_text("blah://blah.com")

℅ does not encode as a domain name (Python's built-in idna encoding is insufficient)

I'm not entirely sure this is a bug, but it sure seems like one:

>>> from hyperlink import URL
>>> text = u'http://\u2105'
>>> url = URL.fromText(text)
>>> url.asIRI()
URL.from_text('http://℅')
>>> url.asURI()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/wsanchez/Dropbox/Developer/BurningMan/ranger-ims-server/.tox/coverage-py36/lib/python3.6/site-packages/hyperlink/_url.py", line 1070, in to_uri
    fragment=_encode_fragment_part(self.fragment, maximal=True)
  File "/Users/wsanchez/Dropbox/Developer/BurningMan/ranger-ims-server/.tox/coverage-py36/lib/python3.6/site-packages/hyperlink/_url.py", line 861, in replace
    userinfo=_optional(userinfo, self.userinfo),
  File "/Users/wsanchez/Dropbox/Developer/BurningMan/ranger-ims-server/.tox/coverage-py36/lib/python3.6/site-packages/hyperlink/_url.py", line 601, in __init__
    self._host = _textcheck("host", host, '/?#@')
  File "/Users/wsanchez/Dropbox/Developer/BurningMan/ranger-ims-server/.tox/coverage-py36/lib/python3.6/site-packages/hyperlink/_url.py", line 410, in _textcheck
    % (''.join(delims), name, value))
ValueError: one or more reserved delimiters /?#@ present in host: 'c/o'

The cause is probably in the IDNA encoding:

>>> "℅".encode("idna")
b'c/o'

This surprises me, but I might just be ignorant about how this works…?

Decode percent-encoding in mixed text

Right now, _url._percent_decode() has a fast and silent path with some surprising results. You can pass in text with percent encoding present, and get that text back out, unmodified, if there are any non-ASCII characters present.

>>> _percent_decode(u'é%3Dmc^2')
u'é%3Dmc^2'

This poses an obvious problem to decoding IRI values containing reserved characters, as is the case for DecodedURL. #54 worked around this by re-percent-encoding everything before percent decoding it. Aside from being a bit hacky, there are more efficient ways of approaching this.

adopt into the Twisted org

My authentication of this ticket filing should serve as my official blessing ;)