john-kurkowski / tldextract Goto Github PK
View Code? Open in Web Editor NEWAccurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
License: BSD 3-Clause "New" or "Revised" License
Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
License: BSD 3-Clause "New" or "Revised" License
The text values passed to extract should strip the text for trailing spaces, eg,
>>> tldextract.extract('ads.adiquity.com ')
ExtractResult(subdomain='ads.adiquity', domain='com ', suffix='')
>>> tldextract.extract('ads.adiquity.com '.strip())
ExtractResult(subdomain='ads', domain='adiquity', suffix='com')
Hi,
thanks for the lib, works great!
but I want to use it in my GAE server,
is there a nice way to load the tld cache file not by reading the file, since there is no real path to it..?
Thanks!
In [2]: tldextract.extract(u'1\xe9')
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-2-11b8ca5967f6> in <module>()
----> 1 tldextract.extract(u'1\xe9')
/home/chungwu/.virtualenvs/pod/local/lib/python2.7/site-packages/tldextract/tldextract.pyc in extract(url)
291 @wraps(TLD_EXTRACTOR.__call__)
292 def extract(url):
--> 293 return TLD_EXTRACTOR(url)
294
295 @wraps(TLD_EXTRACTOR.update)
/home/chungwu/.virtualenvs/pod/local/lib/python2.7/site-packages/tldextract/tldextract.pyc in __call__(self, url)
216 if not tld and netloc and netloc[0].isdigit():
217 try:
--> 218 is_ip = socket.inet_aton(netloc)
219 return ExtractResult('', netloc, '')
220 except AttributeError:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
Since Python 3, all strings are unicode so the u'...' notation do not make sense anymore.
One can't install your package on a Python 3 installation because of the return statement of this function
def fetch_file(urls):
""" Decode the first successfully fetched URL, from UTF-8 encoding to
Python unicode.
"""
s = ''
for url in urls:
try:
conn = urlopen(url)
s = conn.read()
except Exception as e:
LOG.error('Exception reading Public Suffix List url ' + url + ' - ' + str(e) + '.')
else:
return _decode_utf8(s)
LOG.error('No Public Suffix List found. Consider using a mirror or constructing your TLDExtract with `fetch=False`.')
return u''
This module doesn't work for me today sice PUBLIC_SUFFIX_LIST_URL is not accessible.
https://raw.github.com/mozilla/mozilla-central/master/netwerk/dns/effective_tld_names.dat
Here at Veracode we've made use of this wonderful library for quite some time now. It's great.
We'd like to be able to control the Public Suffix List input data more tightly. Namely, we don't just want to set an expiration, but to actually version-control the input data ourselves, and be able to specify a different version of the Public Suffix List file ourselves, at runtime.
This could be a keyword argument like suffix_list_file
. The value would be the path to a file exactly like the file that tldextract downloads. In other words, any plain-text file that is UTF-8 encoded. It should be newline-separated and conform to the syntax specified on the Mozilla Public Suffix List website, of course, but that's really up to the client to be correct about.
I am not 100% sure of what the relationship of this file would be to the cache_file
and fetch
arguments. It seems like the simplest way would be for this file to be a drop-in replacement for the downloaded Suffix List file. (i.e. read in this file if cache_file
does not exist.) TBH, I'm not sure how to force TLDExtract to load in the data anew regardless of whether cache_file
exists, at this point, so I don't know how this new feature would relate to that behavior.
I will probably fork and implement this.
>>> print tldextract.extract(' http://example.com/foo?bla=1#baz ')
ExtractResult(subdomain='', domain=' http', suffix='')
What am I doing wrong?
https://github.com/john-kurkowski/tldextract/blob/master/tldextract/tldextract.py#L344
the pkg_resources shim does not define get_distribution()
So tldextract.main()
will fail if setuptools is not installed.
https://github.com/john-kurkowski/tldextract/blob/master/tldextract/tldextract.py#L44
Line 174:
print >> sys.stderr, line
Uses sys module that its never imported.
Love tldextract, thank you. But I just wanted to suggest something... tld
in the ExtractResults is actually representative of a suffix not a TLD, as (a) this data's coming from the Public Suffix List, and (b) Assuming extract
is an instance of TLDExtract
, extract('foo.co.uk').tld == "co.uk"
. In that case "uk"
is actually the TLD, a country-code TLD to be precise. And "co"
is just conventional... So yes it's good for them to be part of the same information, as we don't want to be told that the domain is "co"
.
I think this misleading term could be easily resolved by changing the name of the attribute to suffix
. Then just make a property that returns it (and doesn't get printed in repr
), called tld
, for backwards compatibility.
tld.extract('http://pajamasam.blogspot.com')
=> {'subdomain='',domain='pajamasam',suffix='blogspot.com'}
tld.extract('http://pajamasam.blogspot.ru')
=> {'subdomain='pajamasam',domain='blogspot',suffix='ru'}
File "/www/sites/virtualenv/lib/python3.4/site-packages/tldextract/tldextract.py", line 267, in _get_tld_extractor
self._cache_tlds(tlds)
File "/www/sites/virtualenv/lib/python3.4/site-packages/tldextract/tldextract.py", line 308, in _cache_tlds
LOG.info("computed TLDs: [%s, ...]", ', '.join(list(tlds)[:10]))
Is throwing me a bunch of UnicodeEncodeErrors on the latest pip version.. any idea what that is all about?
Thanks!
This is a non blocking error that I can capture only if I import logging module.
manuel@Manuel-NG:~>python
Python 2.7.2 (default, Oct 4 2011, 14:55:10)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
...
import tldextract
import logging
logging.basicConfig()
one, two, three = tldextract.extract('forums.bbc.co.uk/nano/micro.html')
ERROR:/Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/tldextract.pyc:error reading TLD cache file /Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/.tld_set: [Errno 2] No such file or directory: '/Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/.tld_set'
WARNING:/Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/tldextract.pyc:unable to cache TLDs in file /Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/.tld_set: [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/.tld_set'
touch'ed and chown'ed .tls_set with no success
chown'ed the folder, everything worked fine. Something to review in the install process, under OSX? (mine is Lion 10.7.3)
Great library! Exactly what I was looking for.
One small request: I'd like to have a little more control over the caching of the public suffix list. We run jobs across a lot of machines, so each machine will get its own cached version of the list. I'd rather not have to manage the cached list on each machine. I think adding a cache expiration parameter would be a huge help with this.
Also, allowing the user to specify the URL for the public suffix list would be great, too. I'd like to host our own cache of the list, so we're not dependent on mozilla's uptime and bandwidth.
I think these two changes would add a lot of flexibility.
If you'd prefer, I can fork a branch and send a pull request with the changes.
The CSIRO domain is often misunderstood as it is the only domain name issued to a specific organisation directly within the .au gTLD (ie: csiro.au), instead of via some other TLD (eg: csiro.org.au).
The doctests below should pass, but don't.
#!/usr/bin/env python
"""
The csiro.au domain is a domain in it's own right.
>>> extract('csiro.au')
ExtractResult(subdomain='', domain='csiro', tld='au')
>>> extract('www.csiro.au')
ExtractResult(subdomain='www', domain='csiro', tld='au')
>>> extract('research.ict.csiro.au')
ExtractResult(subdomain='research.ict', domain='csiro', tld='au')
"""
from tldextract import extract
if __name__ == "__main__":
import doctest
doctest.testmod()
The error is
error reading TLD cache file /usr/home/jaime/venv-qt4-p34dm/lib/python3.4/site-packages/tldextract/.tld_set: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
after doing the first 'extract' in a session.
It's fixed by opening self.cache_file (at _get_tld_extractor) in mode 'rb':
try:
with open(self.cache_file, 'rb') as f:
self._extractor = _PublicSuffixListTLDExtractor(pickle.load(f))
I'm running python 3.4 on a virtual environment, as can be deduced from the path.
Could it please be modified, so calling 'extract' on a fresh 'pip install' setup doesn't complain? I'm planning to use the package in a project. Tell me if more information is required, or if a pull request is needed (this looks trivial, but I may be wrong)
The import statement in line 35 of tldextract.py causes an unsuppressed UserWarning in Python 2.6.
internal server error
(500) status is returned when visiting the example URL from the readme file:
http://tldextract.appspot.com/api/extract?url=http://www.bbc.co.uk/foo/bar/baz.html
internal server error
Would it be possible to add a "fix tld" type feature? I have a list of emails, some of which are not complete, such as "[email protected]" instead of "[email protected]", it would be awesome to figure out a way to recognize state.tx as partially complete.
I know this might be tricky, since state.tx isn't actually a tld, any thoughts?
Hello,
I'm use Python 2.7.4 and have the error:
import tldextract
ls
Traceback (most recent call last):
File "", line 1, in
NameError: name 'ls' is not defined
import tldextract
tldextract.extract('http://forums.news.cnn.com/')
Traceback (most recent call last):
File "", line 1, in
File "/home/andres/prueba/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 239, in extract
return TLD_EXTRACTOR(url)
File "/home/andres/prueba/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 167, in call
registered_domain, tld = self._get_tld_extractor().extract(netloc)
File "/home/andres/prueba/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 207, in _get_tld_extractor
tlds = frozenset(tld for tld_source in tld_sources for tld in tld_source())
File "/home/andres/prueba/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 207, in
tlds = frozenset(tld for tld_source in tld_sources for tld in tld_source())
File "/home/andres/prueba/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 253, in _PublicSuffixListSource
page = _fetch_page('http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1')
File "/home/andres/prueba/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 247, in _fetch_page
return unicode(urlopen(url).read(), 'utf-8')
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 373, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
An example:
extract("HTTP://NYTIMES.COM/")
--> ExtractResult(subdomain='NYTIMES', domain='COM', tld='')
extract("http://nytimes.com/") --> ExtractResult(subdomain='', domain='nytimes', tld='com')
correct me if im wrong but it seems that a uk.com and us.com are top level domains as well?
I often find myself doing '.'.join(tldextract.extract(url)[1:]) which seems kinda hard to remember for something that seems like it would be widely used. basically an alias to return domain + tld separated by a dot.
It seems that TLDExtract does not lower()
its inputs before testing. So if you input foo.CoM
it's not going to extract the suffix com
. Case should be lowered because the case of the Public Suffix List is all lowered, and because DNS is case-insensitive (RFC 4343).
>>> from veracode.utils.data.formatter import extract
>>> extract('example.COM')
ExtractResult(subdomain='example', domain='COM', tld='')
Can you add support for punycode encoded TLDs?
>>> tldextract.extract("xn--80aaaaimf2bemuibxoc5d.xn--p1ai")
ExtractResult(subdomain='xn--80aaaaimf2bemuibxoc5d', domain='xn--p1ai', suffix='')
Should be
>>> tldextract.extract("xn--80aaaaimf2bemuibxoc5d.xn--p1ai")
ExtractResult(subdomain='', domain='xn--80aaaaimf2bemuibxoc5d', suffix='xn--p1ai')
https://www.mozilla.org/en-US/about/governance/policies/security-group/tld-idn/
If I run tldextract --update to update domains file then, because of this line https://github.com/john-kurkowski/tldextract/blob/1.5.1/tldextract/tldextract.py#L264 the private domains are not downloaded.
Therefore, if later I create TLDExtract(include_psl_private_domains=True) class, the private domains are not mapped.
I recommend always downloading the full list and replace line 264 as
TLD_EXTRACTOR = TLDExtract(include_psl_private_domains=True)
When using tldextract on Python 3.4 I noticed that the dict attribute of the namedtuple ExtractResult would always be empty, which caused its _asdict method to also return an empty dict. This is because ExtractResult does not define a slots member, which prevents the parent classes dict from being added. I am not sure which version of Python changed this behavior. I added a unit test (and also added Py34 to the tox.ini environments) for this issue and it passed for Python 2.7, but failed for Python 3.4. Those are the only versions I had installed. I fixed the issue in tldextract.py and all tests now pass, both in 2.7 and 3.4. I created pull request #69 with the fix and additional test.
I have a GAE app instance and use the tldextract library. Everything works on my localhost, however when i push to the GAE app everything breaks and it complains that line: 311 of 'tldextract.py' cannot find file '.tld_set_snapshot'
snapshot_stream = pkg_resources.resource_stream(name, '.tld_set_snapshot')
The tldextract exists in /external/ and the tld_set_snapshot file is resides in \external\tldextract\
This is the stacktrace below:
File "/base/data/home/apps/scmat---/1.389467096411176514/lib/external/tldextract/tldextract.py", line 343, in extractcmat---/1.389467096411176514/lib/external/tldextract/tldextract.py", line 214, in call
return TLD_EXTRACTOR(url)
File "/base/data/home/apps/s
suffix_index = self._get_tld_extractor().suffix_index(translations)
File "/base/data/home/apps/scmat---/1.389467096411176514/lib/external/tldextract/tldextract.py", line 267, in _get_tld_extractorcmat---/1.389467096411176514/lib/external/tldextract/tldextract.py", line 311, in _cache_tlds
self._cache_tlds(tlds)
File "/base/data/home/apps/s
snapshot_stream = pkg_resources.resource_stream(name, '.tld_set_snapshot')
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/setuptools-0.6c11/pkg_resources.py", line 888, in resource_stream
self, resource_name
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/setuptools-0.6c11/pkg_resources.py", line 1285, in get_resource_stream
return open(self._fn(self.module_path, resource_name), 'rb')
IOError: [Errno 2] No such file or directory: '/base/data/home/apps/s~cmat---/1.389467096411176514/lib/external/tldextract/.tld_set_snapshot'
I coludn't properly reproduce the error locally so I don't know exactly under which circumstances this happens but you can check the stacktrace here.
I've solved this issue by encoding in utf-8 the line for printing it. I'll submit a pull request with that change.
I created a new virtualenv, installed tldextract 0.3 with pip and ran our test suite and received the follow stack trace:
File "/Envs/myenv/lib/python2.6/site-packages/tldextract/tldextract.py", line 96, in extract/Envs/myenv/lib/python2.6/site-packages/tldextract/tldextract.py", line 114, in _extract
return _extract(netloc, fetch)
File "
registered_domain, tld = _get_tld_extractor(fetch).extract(netloc)
File "/Envs/myenv/lib/python2.6/site-packages/tldextract/tldextract.py", line 158, in _get_tld_extractor/Envs/myenv/lib/python2.6/site-packages/tldextract/.tld_set_snapshot'
with open(snapshot_file) as f:
IOError: [Errno 2] No such file or directory: '
The issue is fixed when I run python -m tldextract.tldextract http://forums.bbc.co.uk
. This is a problem for us because we are unable to run a command such as python -m tldextract.tldextract http://forums.bbc.co.uk
on each machine after deploying to our cluster.
Repro
TLDExtract(include_psl_private_domains=True).update()
TLDExtract(include_psl_private_domains=False).extract('foo.appspot.com')
Expected: foo appspot com
Actual: foo.appspot com
The opposite order is also broken.
Related to #64.
ext = tldextract.extract('http://www.google.com#dfdfdf')
ext
ExtractResult(subdomain='www.google', domain='com#dfdfdf', tld='')
For some reason parsing a subhost on blogspot results in a messed parsing. The domain bit returns the subdomain, whereas the subdomain returns an empty string.
">>> ext = tldextract.extract("http://dadada.blogspot.com")"
">>> print ext.domain"
"dadada"
">>> print ext.subdomain"
""
Basically a request to also include the path on extract?
Websites such as:
http://www.tfl.gov.uk
http://www.nhsdirect.nhs.uk
This leads to them being separated incorrectly in some circumstances.
Hi,
it seems that stderr output debug logging is enabled by default? I have a scrapy project that uses tldextract once and upon import it seems to spam the log with 2000 debug lines. starts with:
[tldextract] computed TLDs: [chikuho.fukuoka.jp, భారత్, 公司.hk, pvt.ge, matsuzaki.shizuoka.jp, name.eg, tsuruga.fukui.jp, in-addr.arpa, hisayama.fukuoka.jp, name.et, ...]
[stderr] --- .tld_set_snapshot
How I fixed it is:
#disable tldextract stderr spam
logger = logging.getLogger('tldextract')
logger.setLevel('WARNING')
If this is not enabled by default could it be that Tldextract grabs some other loggers level by default ? I'm not too familiar with logging module so I'm not sure what's the cause of this however spaming stderr upon import is something that shouldn't happen.
.tld_set_snapshot and .tld_set use pickle to store the TLD information. While this is perfectly fine in most cases it brings the following issues:
I am having problems with
https://github.com/john-kurkowski/tldextract/blob/master/tldextract/tldextract.py#L182
and doing
## We need to monkey patch in the new Python IO
## because of http://www.freebsd.org/cgi/query-pr.cgi?pr=148581
## when using the kqueue reactor and twistd.
##
import io, __builtin__
__builtin__.open = io.open
File "c:\python27\lib\site-packages\tldextract-1.1-py2.7.egg\tldextract\tldextract.py", line 187, in _get_tld_extractor
pickle.dump(tlds, f)
exceptions.TypeError: must be unicode, not str
The issue can be solved by opening the file for binary:
with open(cached_file, 'wb') as f:
There are multiple uses of the Mozilla public suffix list which allow sites such as "appspot.com" to appear on the list as a tld instead of being split into domain="appspot" and tld="com".
This is perfectly reasonable behavior for some use cases, but for others it would be helpful to have the "private" domains be excluded. Mozilla has split the list into "ICANN Domains" and "Private Domains", and it would be useful to optionally be able to exclude the private domains so that sites like "appspot.com" would have their tld reflected as "com".
I looks like Mozilla may be blocking HTTP access to the effective_tld_names.dat file:
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
It's been down for at least 24 hours, but the rest of mxr.mozilla.org is alive which makes me think the URL is deliberately blocked.
Here is a GitHub mirror of the file:
https://raw.github.com/mozilla/mozilla-central/master/netwerk/dns/effective_tld_names.dat
In [1]: import tldextract
In [2]: url='twitter.github.io'
In [3]: tldextract.extract(url)
Out[3]: ExtractResult(subdomain='', domain='twitter', suffix='github.io')
I'll try to create a pull request :)
Having given url:
xx.yyy.domain.com/path_me_somewhere.html
it would be nice if tldextract could also return the path (and other components) of the url.
for example, instead of this code
url_parts = urlparse.urlparse(some_funny_url)
url = tldextract.extract("%s://%s" % (url_parts.scheme, url_parts.netloc))
url = "%s://%s.%s%s" % (url_parts.scheme, url.domain, url.tld, url_parts.path)
it could be as simple as this:
url = tldextract.extract(some_funny_url)
url = "%s://%s.%s%s" % (url.scheme, url.domain, url.tld, url.path)
what do you think?
The PyPi trove classifiers seem to indicate that this package is BSD-licensed, but there's no explicit license statement in the source. It would make it easier for me to use tldextract if the licensing was made clearer.
The argument "suffix_list_url" seems to have been replaced with "fetch".
Thanks!
Hey,
The regex used by tldextract fails to detect ".com.au" properly, amongst many others on this page:
https://wiki.mozilla.org/TLD_List
Might be worth updating to match this list?? I would do it, but I don't have any unit tests :(
Cal
for example:
"xn--h1alffa9f.xn--p1ai" results in
ExtractResult(subdomain='xn--h1alffa9f', domain='xn--p1ai', suffix='')
Workaround:
def tldextract_with_puny(url):
if '.xn--' in url: #punycode domains
o = urlparse.urlparse(url)
url = o.netloc.decode('idna')
ext = tldextract.extract(url)
return ext
Empty paths are not handled correctly:
tldextract.extract('www.whatever.com?param=1')
ExtractResult(subdomain='www.whatever', domain='com?param=1', tld='')
That should give the same result as:
tldextract.extract('www.whatever.com/?param=1')
Regards,
Stephane.
>>> tldextract.extract("http://www.blogspot.de/")
ExtractResult(subdomain='', domain='www', tld='blogspot.de')
Any chance you'll add more control over list file caching, e.g. passing in a PSL cache file to a constructor, or specifying another location to store the cache? If the user running a script using tldextract doesn't have write access to /path/to/tldextract/.tld_set and it doesn't exist as yet, this can cause a problem.
Additionally, any thoughts about being able to specify a time interval to keep the cache, and refresh if it's older than that? I like the caching flexibility in publicsuffix (http://pypi.python.org/pypi/publicsuffix/1.0.0), but I prefer the way you do you matching and returning a named tuple.
pkg_resources
isn't available on GAE
<type 'exceptions.ImportError'>: No module named pkg_resources
Traceback (most recent call last):
File "/base/data/home/apps/tldextract/1.353646899194797005/handlers.py", line 2, in <module>
import tldextract
File "/base/data/home/apps/tldextract/1.353646899194797005/tldextract/__init__.py", line 1, in <module>
from tldextract import extract, urlsplit
File "/base/data/home/apps/tldextract/1.353646899194797005/tldextract/tldextract.py", line 30, in <module>
import pkg_resources
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.