john-kurkowski / tldextract Goto Github PK

View Code? Open in Web Editor NEW

1.8K 47.0 211.0 902 KB

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

python tldextract country-codes suffix tld publicsuffix publicsuffixlist

tldextract's People

Contributors

Stargazers

Watchers

Forkers

mvasilkov shailen nava45 catalanojuan agasheesh senvey oberstet ipostelnik xtang zzc3266825 alepharchives kirillshaman sebdraven sunzhongwei joskid llonchj budlight msabramo simcen seedwithroot syphar bawerd t33chong pombredanne gjl zclfly idonethis imclab lifuzu viruthagiri yaojialyu shananra saloua-cliqz nmyd122 agmagor strcrzy jpedraza atassumer groda tomanthony amitchd priestd09 teamhg-memex widgital xdanx otakucode he0x mathlover777 biwin maxmzkr jvrsantacruz pjsg evanvolgas mauricioabreu dfeinzeig arski noscripter cloudxtreme khalidh64 udemy sangyf edwardbetts tylerlubeck sfumanal dav009 nkhuyu karantan akashawasthi adeled richardteachout kanazux kalyanp awesome-python k9team3 minhlongdo torksrevol adamchainz joswinkj cs24 galeksandrp mattwoods11 a1gupta thisiseast benjamindhorne skypather krodyx jvanasco burstworks bkawan kirknorthrop weizhili-relfektion loganding hyzhak bastbnl zpzgone calebmadrigal inovaktif pin3da crackercat boo139

tldextract's Issues

Domain parsing fails with trailing spaces

The text values passed to extract should strip the text for trailing spaces, eg,

>>> tldextract.extract('ads.adiquity.com ')
ExtractResult(subdomain='ads.adiquity', domain='com ', suffix='')
>>> tldextract.extract('ads.adiquity.com   '.strip())
ExtractResult(subdomain='ads', domain='adiquity', suffix='com')

using in GAE

Hi,
thanks for the lib, works great!
but I want to use it in my GAE server,
is there a nice way to load the tld cache file not by reading the file, since there is no real path to it..?
Thanks!

Encoding issue when passing in unicode domain that tldextract thinks might be an IP address

In [2]: tldextract.extract(u'1\xe9')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-2-11b8ca5967f6> in <module>()
----> 1 tldextract.extract(u'1\xe9')

/home/chungwu/.virtualenvs/pod/local/lib/python2.7/site-packages/tldextract/tldextract.pyc in extract(url)
    291 @wraps(TLD_EXTRACTOR.__call__)
    292 def extract(url):    
--> 293     return TLD_EXTRACTOR(url)
    294 
    295 @wraps(TLD_EXTRACTOR.update)

/home/chungwu/.virtualenvs/pod/local/lib/python2.7/site-packages/tldextract/tldextract.pyc in __call__(self, url)
    216         if not tld and netloc and netloc[0].isdigit():
    217             try:
--> 218                 is_ip = socket.inet_aton(netloc)
    219                 return ExtractResult('', netloc, '')
    220             except AttributeError:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)

Python 3.2 compatibility

Since Python 3, all strings are unicode so the u'...' notation do not make sense anymore.

One can't install your package on a Python 3 installation because of the return statement of this function

def fetch_file(urls):
    """ Decode the first successfully fetched URL, from UTF-8 encoding to
    Python unicode.
    """
    s = ''

    for url in urls:
        try:
            conn = urlopen(url)
            s = conn.read()
        except Exception as e:
            LOG.error('Exception reading Public Suffix List url ' + url + ' - ' + str(e) + '.')
        else:
            return _decode_utf8(s)

    LOG.error('No Public Suffix List found. Consider using a mirror or constructing your TLDExtract with `fetch=False`.')
    return u''

PUBLIC_SUFFIX_LIST_URL is not accessible today.

This module doesn't work for me today sice PUBLIC_SUFFIX_LIST_URL is not accessible.

https://raw.github.com/mozilla/mozilla-central/master/netwerk/dns/effective_tld_names.dat

Add kwarg for `suffix_list_file`

Here at Veracode we've made use of this wonderful library for quite some time now. It's great.

We'd like to be able to control the Public Suffix List input data more tightly. Namely, we don't just want to set an expiration, but to actually version-control the input data ourselves, and be able to specify a different version of the Public Suffix List file ourselves, at runtime.

This could be a keyword argument like suffix_list_file. The value would be the path to a file exactly like the file that tldextract downloads. In other words, any plain-text file that is UTF-8 encoded. It should be newline-separated and conform to the syntax specified on the Mozilla Public Suffix List website, of course, but that's really up to the client to be correct about.

I am not 100% sure of what the relationship of this file would be to the cache_file and fetch arguments. It seems like the simplest way would be for this file to be a drop-in replacement for the downloaded Suffix List file. (i.e. read in this file if cache_file does not exist.) TBH, I'm not sure how to force TLDExtract to load in the data anew regardless of whether cache_file exists, at this point, so I don't know how this new feature would relate to that behavior.

I will probably fork and implement this.

Problem parsing URLs with spaces

>>> print tldextract.extract('   http://example.com/foo?bla=1#baz   ')
ExtractResult(subdomain='', domain='   http', suffix='')

What am I doing wrong?

fake pkg_resources does not define get_distribution

https://github.com/john-kurkowski/tldextract/blob/master/tldextract/tldextract.py#L344

the pkg_resources shim does not define get_distribution() So tldextract.main() will fail if setuptools is not installed.

https://github.com/john-kurkowski/tldextract/blob/master/tldextract/tldextract.py#L44

Missing import of sys module

Line 174:

print >> sys.stderr, line

Uses sys module that its never imported.

Conflation of the term "suffix" with "tld"

Love tldextract, thank you. But I just wanted to suggest something... tld in the ExtractResults is actually representative of a suffix not a TLD, as (a) this data's coming from the Public Suffix List, and (b) Assuming extract is an instance of TLDExtract, extract('foo.co.uk').tld == "co.uk". In that case "uk" is actually the TLD, a country-code TLD to be precise. And "co" is just conventional... So yes it's good for them to be part of the same information, as we don't want to be told that the domain is "co".

I think this misleading term could be easily resolved by changing the name of the attribute to suffix. Then just make a property that returns it (and doesn't get printed in repr), called tld, for backwards compatibility.

blogspot and other URLs not working

tld.extract('http://pajamasam.blogspot.com')
=> {'subdomain='',domain='pajamasam',suffix='blogspot.com'}

tld.extract('http://pajamasam.blogspot.ru')
=> {'subdomain='pajamasam',domain='blogspot',suffix='ru'}

UnicodeEncodeErrors on logging

File "/www/sites/virtualenv/lib/python3.4/site-packages/tldextract/tldextract.py", line 267, in _get_tld_extractor
    self._cache_tlds(tlds)
File "/www/sites/virtualenv/lib/python3.4/site-packages/tldextract/tldextract.py", line 308, in _cache_tlds
    LOG.info("computed TLDs: [%s, ...]", ', '.join(list(tlds)[:10]))

Is throwing me a bunch of UnicodeEncodeErrors on the latest pip version.. any idea what that is all about?

Thanks!

logging error

This is a non blocking error that I can capture only if I import logging module.

manuel@Manuel-NG:~>python
Python 2.7.2 (default, Oct 4 2011, 14:55:10)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
...

import tldextract
import logging
logging.basicConfig()
one, two, three = tldextract.extract('forums.bbc.co.uk/nano/micro.html')

ERROR:/Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/tldextract.pyc:error reading TLD cache file /Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/.tld_set: [Errno 2] No such file or directory: '/Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/.tld_set'
WARNING:/Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/tldextract.pyc:unable to cache TLDs in file /Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/.tld_set: [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/.tld_set'

touch'ed and chown'ed .tls_set with no success

chown'ed the folder, everything worked fine. Something to review in the install process, under OSX? (mine is Lion 10.7.3)

Feature request for caching

Great library! Exactly what I was looking for.

One small request: I'd like to have a little more control over the caching of the public suffix list. We run jobs across a lot of machines, so each machine will get its own cached version of the list. I'd rather not have to manage the cached list on each machine. I think adding a cache expiration parameter would be a huge help with this.

Also, allowing the user to specify the URL for the public suffix list would be great, too. I'd like to host our own cache of the list, so we're not dependent on mozilla's uptime and bandwidth.

I think these two changes would add a lot of flexibility.

If you'd prefer, I can fork a branch and send a pull request with the changes.

Incorrect results for the csiro.au domain.

The CSIRO domain is often misunderstood as it is the only domain name issued to a specific organisation directly within the .au gTLD (ie: csiro.au), instead of via some other TLD (eg: csiro.org.au).

The doctests below should pass, but don't.

    #!/usr/bin/env python
    """    
    The csiro.au domain is a domain in it's own right.

    >>> extract('csiro.au')
    ExtractResult(subdomain='', domain='csiro', tld='au')

    >>> extract('www.csiro.au')
    ExtractResult(subdomain='www', domain='csiro', tld='au')

    >>> extract('research.ict.csiro.au')
    ExtractResult(subdomain='research.ict', domain='csiro', tld='au')
    """

    from tldextract import extract

    if __name__ == "__main__":
        import doctest
        doctest.testmod()

Opening the pickled cache in text mode fails to unpickle (python3)

The error is

error reading TLD cache file /usr/home/jaime/venv-qt4-p34dm/lib/python3.4/site-packages/tldextract/.tld_set: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

after doing the first 'extract' in a session.

It's fixed by opening self.cache_file (at _get_tld_extractor) in mode 'rb':

        try:                                                                
            with open(self.cache_file, 'rb') as f:                          
                self._extractor = _PublicSuffixListTLDExtractor(pickle.load(f))

I'm running python 3.4 on a virtual environment, as can be deduced from the path.

Could it please be modified, so calling 'extract' on a fresh 'pip install' setup doesn't complain? I'm planning to use the package in a project. Tell me if more information is required, or if a pull request is needed (this looks trivial, but I may be wrong)

Spurious warnings from import pkg_resources in Python 2.6

The import statement in line 35 of tldextract.py causes an unsuppressed UserWarning in Python 2.6.

tldextract.appspot.com app returns 500 error

internal server error (500) status is returned when visiting the example URL from the readme file:

http://tldextract.appspot.com/api/extract?url=http://www.bbc.co.uk/foo/bar/baz.html

internal server error

FEATURE REQUEST

Would it be possible to add a "fix tld" type feature? I have a list of emails, some of which are not complete, such as "[email protected]" instead of "[email protected]", it would be awesome to figure out a way to recognize state.tx as partially complete.

I know this might be tricky, since state.tx isn't actually a tld, any thoughts?

httplib.BadStatusLine: ''

Hello,

I'm use Python 2.7.4 and have the error:

import tldextract
ls
Traceback (most recent call last):
File "", line 1, in
NameError: name 'ls' is not defined
import tldextract
tldextract.extract('http://forums.news.cnn.com/')
Traceback (most recent call last):
File "", line 1, in
File "/home/andres/prueba/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 239, in extract
return TLD_EXTRACTOR(url)
File "/home/andres/prueba/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 167, in call
registered_domain, tld = self._get_tld_extractor().extract(netloc)
File "/home/andres/prueba/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 207, in _get_tld_extractor
tlds = frozenset(tld for tld_source in tld_sources for tld in tld_source())
File "/home/andres/prueba/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 207, in
tlds = frozenset(tld for tld_source in tld_sources for tld in tld_source())
File "/home/andres/prueba/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 253, in _PublicSuffixListSource
page = _fetch_page('http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1')
File "/home/andres/prueba/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 247, in _fetch_page
return unicode(urlopen(url).read(), 'utf-8')
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 373, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''

Upper-case URLs return incorrect results

An example:

incorrect: extract("HTTP://NYTIMES.COM/") --> ExtractResult(subdomain='NYTIMES', domain='COM', tld='')
correct: extract("http://nytimes.com/") --> ExtractResult(subdomain='', domain='nytimes', tld='com')

uk.com and us.com?

correct me if im wrong but it seems that a uk.com and us.com are top level domains as well?

Should you provide an alias for second level domain names?

I often find myself doing '.'.join(tldextract.extract(url)[1:]) which seems kinda hard to remember for something that seems like it would be widely used. basically an alias to return domain + tld separated by a dot.

TLDExtract is case-sensitive

It seems that TLDExtract does not lower() its inputs before testing. So if you input foo.CoM it's not going to extract the suffix com. Case should be lowered because the case of the Public Suffix List is all lowered, and because DNS is case-insensitive (RFC 4343).

>>> from veracode.utils.data.formatter import extract
>>> extract('example.COM')
ExtractResult(subdomain='example', domain='COM', tld='')

Support for punycode encoded TLDs

Can you add support for punycode encoded TLDs?

>>> tldextract.extract("xn--80aaaaimf2bemuibxoc5d.xn--p1ai")
ExtractResult(subdomain='xn--80aaaaimf2bemuibxoc5d', domain='xn--p1ai', suffix='')

Should be

>>> tldextract.extract("xn--80aaaaimf2bemuibxoc5d.xn--p1ai")
ExtractResult(subdomain='', domain='xn--80aaaaimf2bemuibxoc5d', suffix='xn--p1ai')

https://www.mozilla.org/en-US/about/governance/policies/security-group/tld-idn/

tldextract --update does not download private domains

If I run tldextract --update to update domains file then, because of this line https://github.com/john-kurkowski/tldextract/blob/1.5.1/tldextract/tldextract.py#L264 the private domains are not downloaded.
Therefore, if later I create TLDExtract(include_psl_private_domains=True) class, the private domains are not mapped.
I recommend always downloading the full list and replace line 264 as
TLD_EXTRACTOR = TLDExtract(include_psl_private_domains=True)

_asdict on ExtractResult empty on Python 3+

When using tldextract on Python 3.4 I noticed that the dict attribute of the namedtuple ExtractResult would always be empty, which caused its _asdict method to also return an empty dict. This is because ExtractResult does not define a slots member, which prevents the parent classes dict from being added. I am not sure which version of Python changed this behavior. I added a unit test (and also added Py34 to the tox.ini environments) for this issue and it passed for Python 2.7, but failed for Python 3.4. Those are the only versions I had installed. I fixed the issue in tldextract.py and all tests now pass, both in 2.7 and 3.4. I created pull request #69 with the fix and additional test.

No such file or directory: /external/tldextract/.tld_set_snapshot' on GAE App

I have a GAE app instance and use the tldextract library. Everything works on my localhost, however when i push to the GAE app everything breaks and it complains that line: 311 of 'tldextract.py' cannot find file '.tld_set_snapshot'
snapshot_stream = pkg_resources.resource_stream(name, '.tld_set_snapshot')

The tldextract exists in /external/ and the tld_set_snapshot file is resides in \external\tldextract\

This is the stacktrace below:

File "/base/data/home/apps/scmat---/1.389467096411176514/lib/external/tldextract/tldextract.py", line 343, in extract
return TLD_EXTRACTOR(url)
File "/base/data/home/apps/scmat---/1.389467096411176514/lib/external/tldextract/tldextract.py", line 214, in call
suffix_index = self._get_tld_extractor().suffix_index(translations)
File "/base/data/home/apps/scmat---/1.389467096411176514/lib/external/tldextract/tldextract.py", line 267, in _get_tld_extractor
self._cache_tlds(tlds)
File "/base/data/home/apps/scmat---/1.389467096411176514/lib/external/tldextract/tldextract.py", line 311, in _cache_tlds
snapshot_stream = pkg_resources.resource_stream(name, '.tld_set_snapshot')
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/setuptools-0.6c11/pkg_resources.py", line 888, in resource_stream
self, resource_name
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/setuptools-0.6c11/pkg_resources.py", line 1285, in get_resource_stream
return open(self._fn(self.module_path, resource_name), 'rb')
IOError: [Errno 2] No such file or directory: '/base/data/home/apps/s~cmat---/1.389467096411176514/lib/external/tldextract/.tld_set_snapshot'

Encoding problems with line 175 of tldextract module

I coludn't properly reproduce the error locally so I don't know exactly under which circumstances this happens but you can check the stacktrace here.

I've solved this issue by encoding in utf-8 the line for printing it. I'll submit a pull request with that change.

.tld_set_snapshot does not exist error

I created a new virtualenv, installed tldextract 0.3 with pip and ran our test suite and received the follow stack trace:

File "/Envs/myenv/lib/python2.6/site-packages/tldextract/tldextract.py", line 96, in extract
return _extract(netloc, fetch)
File "/Envs/myenv/lib/python2.6/site-packages/tldextract/tldextract.py", line 114, in _extract
registered_domain, tld = _get_tld_extractor(fetch).extract(netloc)
File "/Envs/myenv/lib/python2.6/site-packages/tldextract/tldextract.py", line 158, in _get_tld_extractor
with open(snapshot_file) as f:
IOError: [Errno 2] No such file or directory: '/Envs/myenv/lib/python2.6/site-packages/tldextract/.tld_set_snapshot'

The issue is fixed when I run python -m tldextract.tldextract http://forums.bbc.co.uk. This is a problem for us because we are unable to run a command such as python -m tldextract.tldextract http://forums.bbc.co.uk on each machine after deploying to our cluster.

Public/private results depend on how cache file was persisted

Repro

TLDExtract(include_psl_private_domains=True).update()
TLDExtract(include_psl_private_domains=False).extract('foo.appspot.com')

Expected: foo appspot com
Actual: foo.appspot com

The opposite order is also broken.

Related to #64.

Unexpected behaviour with hash URLs..

ext = tldextract.extract('http://www.google.com#dfdfdf')
ext
ExtractResult(subdomain='www.google', domain='com#dfdfdf', tld='')

Problem with blogspot.com

For some reason parsing a subhost on blogspot results in a messed parsing. The domain bit returns the subdomain, whereas the subdomain returns an empty string.

">>> ext = tldextract.extract("http://dadada.blogspot.com")"
">>> print ext.domain"
"dadada"
">>> print ext.subdomain"
""

Include the path ?

Basically a request to also include the path on extract?

domain and public suffix urls

Websites such as:

http://www.gov.uk
http://www.nhs.uk
are urls which are both public suffixes and domains. E.g.

http://www.tfl.gov.uk
http://www.nhsdirect.nhs.uk

This leads to them being separated incorrectly in some circumstances.

stderr spam, debug output to info enabled by default?

Hi,
it seems that stderr output debug logging is enabled by default? I have a scrapy project that uses tldextract once and upon import it seems to spam the log with 2000 debug lines. starts with:

[tldextract] computed TLDs: [chikuho.fukuoka.jp, భారత్, 公司.hk, pvt.ge, matsuzaki.shizuoka.jp, name.eg, tsuruga.fukui.jp, in-addr.arpa, hisayama.fukuoka.jp, name.et, ...]
[stderr] --- .tld_set_snapshot

How I fixed it is:

#disable tldextract stderr spam
logger = logging.getLogger('tldextract')
logger.setLevel('WARNING')

If this is not enabled by default could it be that Tldextract grabs some other loggers level by default ? I'm not too familiar with logging module so I'm not sure what's the cause of this however spaming stderr upon import is something that shouldn't happen.

Use JSON instead of pickle for tld data

.tld_set_snapshot and .tld_set use pickle to store the TLD information. While this is perfectly fine in most cases it brings the following issues:

Binary data stored in your git repository is a bad practice
If this library gets packaged for ubuntu/debian, the maintainer will complain. I just started using tldextract in the w3af project which is part of ubuntu/debian. When the package maintainer packages the next version he'll most likely dislike the binary blob
Pickles are "executables". A specially crafted pickle can trigger an arbitrary remote command execution when unpickled. While I did review the library for bugs and backdoors before including it in w3af, I did not read the whole pickle; which is bad for w3af user's security.

Cache file not opened in binary mode

I am having problems with

 https://github.com/john-kurkowski/tldextract/blob/master/tldextract/tldextract.py#L182

and doing

 ## We need to monkey patch in the new Python IO
 ## because of http://www.freebsd.org/cgi/query-pr.cgi?pr=148581
 ## when using the kqueue reactor and twistd.
 ##
 import io, __builtin__
 __builtin__.open = io.open

      File "c:\python27\lib\site-packages\tldextract-1.1-py2.7.egg\tldextract\tldextract.py", line 187, in _get_tld_extractor
        pickle.dump(tlds, f)
    exceptions.TypeError: must be unicode, not str

The issue can be solved by opening the file for binary:

 with open(cached_file, 'wb') as f:

Incorrect tld for appspot.com (support excluding private domains on lookup)

There are multiple uses of the Mozilla public suffix list which allow sites such as "appspot.com" to appear on the list as a tld instead of being split into domain="appspot" and tld="com".

This is perfectly reasonable behavior for some use cases, but for others it would be helpful to have the "private" domains be excluded. Mozilla has split the list into "ICANN Domains" and "Private Domains", and it would be useful to optionally be able to exclude the private domains so that sites like "appspot.com" would have their tld reflected as "com".

Mozilla effective_tld_names link broken

I looks like Mozilla may be blocking HTTP access to the effective_tld_names.dat file:
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

It's been down for at least 24 hours, but the rest of mxr.mozilla.org is alive which makes me think the URL is deliberately blocked.

Here is a GitHub mirror of the file:

https://raw.github.com/mozilla/mozilla-central/master/netwerk/dns/effective_tld_names.dat

.io isn't parsed correctly

In [1]: import tldextract
In [2]: url='twitter.github.io'
In [3]: tldextract.extract(url)
Out[3]: ExtractResult(subdomain='', domain='twitter', suffix='github.io')

I'll try to create a pull request :)

Make other url's parts available

Having given url:
xx.yyy.domain.com/path_me_somewhere.html

it would be nice if tldextract could also return the path (and other components) of the url.

for example, instead of this code

url_parts = urlparse.urlparse(some_funny_url)
url = tldextract.extract("%s://%s" % (url_parts.scheme, url_parts.netloc))
url = "%s://%s.%s%s" % (url_parts.scheme, url.domain, url.tld, url_parts.path)

it could be as simple as this:

url = tldextract.extract(some_funny_url)
url = "%s://%s.%s%s" % (url.scheme, url.domain, url.tld, url.path)

what do you think?

Explicit license

The PyPi trove classifiers seem to indicate that this package is BSD-licensed, but there's no explicit license statement in the source. It would make it easier for me to use tldextract if the licensing was made clearer.

Bug in README

The argument "suffix_list_url" seems to have been replaced with "fetch".

Thanks!

Regex does not detect many TLDs properly

Hey,

The regex used by tldextract fails to detect ".com.au" properly, amongst many others on this page:

https://wiki.mozilla.org/TLD_List

Might be worth updating to match this list?? I would do it, but I don't have any unit tests :(

Cal

Incorrect suffix for punycode domains

for example:
"xn--h1alffa9f.xn--p1ai" results in
ExtractResult(subdomain='xn--h1alffa9f', domain='xn--p1ai', suffix='')

Workaround:

def tldextract_with_puny(url):
    if '.xn--' in url: #punycode domains
        o = urlparse.urlparse(url)
        url = o.netloc.decode('idna')
    ext = tldextract.extract(url)
    return ext

Dealing with empty paths

Empty paths are not handled correctly:

tldextract.extract('www.whatever.com?param=1')
ExtractResult(subdomain='www.whatever', domain='com?param=1', tld='')

That should give the same result as:
tldextract.extract('www.whatever.com/?param=1')

Regards,
Stephane.

Invalid tld blogspot.de

>>> tldextract.extract("http://www.blogspot.de/")
ExtractResult(subdomain='', domain='www', tld='blogspot.de')

PSL caching

Any chance you'll add more control over list file caching, e.g. passing in a PSL cache file to a constructor, or specifying another location to store the cache? If the user running a script using tldextract doesn't have write access to /path/to/tldextract/.tld_set and it doesn't exist as yet, this can cause a problem.

Additionally, any thoughts about being able to specify a time interval to keep the cache, and refresh if it's older than that? I like the caching flexibility in publicsuffix (http://pypi.python.org/pypi/publicsuffix/1.0.0), but I prefer the way you do you matching and returning a named tuple.

0.3.1+ tldextract_app incompatible with GAE

pkg_resources isn't available on GAE

<type 'exceptions.ImportError'>: No module named pkg_resources
Traceback (most recent call last):
  File "/base/data/home/apps/tldextract/1.353646899194797005/handlers.py", line 2, in <module>
    import tldextract
  File "/base/data/home/apps/tldextract/1.353646899194797005/tldextract/__init__.py", line 1, in <module>
    from tldextract import extract, urlsplit
  File "/base/data/home/apps/tldextract/1.353646899194797005/tldextract/tldextract.py", line 30, in <module>
    import pkg_resources