layershifter / tldextract Goto Github PK
View Code? Open in Web Editor NEW[DEPRECATED] Library for extraction of domain parts e.g. TLD. Domain parser that uses Public Suffix List
License: Apache License 2.0
[DEPRECATED] Library for extraction of domain parts e.g. TLD. Domain parser that uses Public Suffix List
License: Apache License 2.0
Package needs full IDN support, move it has problems with punycoded domain's suffixes.
I'm running some domain names through TLDExtract
and came across a domain not being properly parsed.
The URL is called blogspot.com
$url = 'blogspot.com';
$domain = tld_extract($url);
var_dump($domain);
Returns:
object(LayerShifter\TLDExtract\Result)[9]
private 'subdomain' => null
private 'hostname' => string 'blogspot.com' (length=12)
private 'suffix' => null
Weirdly the URL 'flogspot.com' works fine and returns:
object(LayerShifter\TLDExtract\Result)[9]
private 'subdomain' => null
private 'hostname' => string 'flogspot' (length=8)
private 'suffix' => string 'com' (length=3)
The URL logspot.com also works and returns:
object(LayerShifter\TLDExtract\Result)[9]
private 'subdomain' => null
private 'hostname' => string 'logspot' (length=7)
private 'suffix' => string 'com' (length=3)
Any idea why the TLD in 'blogspot.com' is not being added to the suffix
? Is this a bug?
For "http://example.com?foo=bar", getRegistrableDomain() returns "example.com?foo=bar" rather than "example.com".
My temporary quick fix in my code is to replace all "?" characters with "/" before passing the URL to TLDExtract.
The parser fails for the following:
// If the subdomain has "-"
$url = 'https://s3-ap-southeast-2.amazonaws.com/blabla/blabla/wp-content/uploads/media/2019/03/16860571424_31c94205de_b.jpg';
// Extract domain parts
$extract = new \LayerShifter\TLDExtract\Extract();
$domainParser = $extract->parse($url);
parse_url($url, PHP_URL_HOST); // s3-ap-southeast-2.amazonaws.com
$domainParser->getSubdomain(); // null
Some reading for improve or fix in case it’s needed :
https://en.wikipedia.org/wiki/Domain_name
introduce
https://tools.ietf.org/html/rfc1034
https://tools.ietf.org/html/rfc1035
domain name are URL
https://en.wikipedia.org/wiki/Url
introduce
https://url.spec.whatwg.org/
should permit to close #45
$extract = new Extract();
$result = $extract->parse("test@[email protected]")->isValidDomain();
This results in true
, when it should be false
("@" is not allowed in domain names)
Hello,
would it be possible to add an exception for this special name which is used in the very popular symfony framework?
Bye
Hello,
Extracting from some subdomains is not working as expected.
For example, extracting from:
I get things like:
object(LayerShifter\TLDExtract\Result)[6]
private 'subdomain' => null
private 'hostname' => string 'some-long-string' (length=16)
private 'suffix' => string 'us-west-2.elb.amazonaws.com' (length=27)
object(LayerShifter\TLDExtract\Result)[5]
private 'subdomain' => null
private 'hostname' => string 'whatever' (length=8)
private 'suffix' => string 'cloudfront.net' (length=14)
object(LayerShifter\TLDExtract\Result)[7]
private 'subdomain' => null
private 'hostname' => string 'whatever' (length=8)
private 'suffix' => string 'googleapis.com' (length=14)
Hi, I'm getting the error:
Stack trace:
Message: Uncaught OutOfRangeException: Unknown field "errors" in xxx/vendor/layershifter/tld-extract/src/Result.php:220
File: xxx/vendor/layershifter/tld-extract/src/Result.php
Line: 220
thrown
#0 LayerShifter\TLDExtract\Result->__get('errors')
My code is straight from the example:
$extract = new LayerShifter\TLDExtract\Extract();
$result = $extract->parse('shop.github.com');
$result->getRegistrableDomain();
Composer install details:
Using version ^2.0 for layershifter/tld-extract
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Package operations: 5 installs, 0 updates, 0 removals
- Installing symfony/polyfill-php72 (v1.10.0): Downloading (100%)
- Installing symfony/polyfill-intl-idn (v1.10.0): Downloading (100%)
- Installing layershifter/tld-support (1.1.1): Downloading (100%)
- Installing layershifter/tld-database (1.0.65): Downloading (100%)
- Installing layershifter/tld-extract (2.0.1): Downloading (100%)
Am I missing something or doing something wrong?
Many thanks,
jkns
Hello,
can you please change the guzzlehttp/guzzle
package in your composer version 0.2.0?
guzzlehttp/guzzle
version 6.* doesn't support PHP 5.4.0. But version 5.* does.
So can you change it to "guzzlehttp/guzzle": "^5.3.1",
and update the tag version to 0.2.1 ?
Thank for your help
Hi,
i've found a weird behavior in domain extraction;
$extract = new Extract();
$result = $extract->parse('dkim._domainkey.phea.fr');
print_r($result->toArray());
$result = $extract->parse('dkim.domainkey.phea.fr');
print_r($result->toArray());
result
Array
(
[subdomain] => dkim._domainkey.phea
[hostname] => fr
[suffix] =>
)
Array
(
[subdomain] => dkim.domainkey
[hostname] => phea
[suffix] => fr
)
the problem come from _
character.
this regex fix the problem
const HOSTNAME_PATTERN = '#^((?!-)[a-z0-9_-]{0,62}[a-z0-9]\.)+[a-z]{2,63}|[xn\-\-a-z0-9]]{6,63}$#';
Currently the suffixExists
method on the Extract
class is private, it would be useful if it was protected, so you could override default behavior and provide an alternate data source
As part of some validation requirements, I wanted to loop over subdomain labels, and since an array was the expected return of the getSubdomains()
function, I assumed it would return an empty array if none were found.
Turns out it returns a null
, meaning it broke any attempts to use it as the expected return type. I ended up working around this using a null coalescence like the following:
$subdomainLabels = $hostnameExtract->getSubdomains() ?? array();
So, should this function return an empty array by default? I'd personally prefer if it did, since it keeps the return type consistent - but others might have differing opinions.
$result = $extract->parse('test..com');
$result->isValidDomain(); // gives true
Registrable Domain ist .com in this case... I think this domain name should be invalid.
$url = 'http://exam-plewdgrfWEDRKGJHBSAFVHBJKSDAVJKHBDSVJKHBSDJKBVFJHBDSJHBSDJKBVASDJKHBASDJKHBFJKADBSVCJKHBASDCJHBJHASBJUHBEJDHBJASBDHABSDMNBASJKBHJWQHBDMNASBDJHEWQRJHBAJSHBMANSBDJHQWEDJHBASMNDBMASNDBMBgwqe5vqwerfvcqw4vtwergfkhabdsvjkhbqerjkhbajskdbfcjkahsgbecauhsdeclkjhslkjadhfjkasberflkjazdklcklaDSJASNDKLNAKSDJNFKLAJSDJNASKNF.com';
$extract = new LayerShifter\TLDExtract\Extract();
$result = $extract->parse($url);
$domain = $result->getRegistrableDomain();
echo $result->isValidDomain();
will return TRUE when it should return FLASE
I have a loop that extracts registerable domains. However, despite most of them running fine, a still have a lot of empty responses. What could be the cause that it does not find a registerable domain for the following examples :
mail-sor-f69.google.com (you would expect google.com)
static.vnpt.vn (you would expect vnpt.vn)
mx1.sub5.homie.mail.dreamhost.com (you would expect dreamhost.com)
and there are many , many, many more.
Symfony now provide a official polyfill which is maintained by them also described on the punycode github page as activitly maintained alternative: https://github.com/symfony/polyfill-intl-idn
I use this library inside a symfony project where intl is not required because symfony give a polyfill for it.
Currently its not possible to use this library without intl installed (composer will error). I would recommend to use the symfony intl polyfill https://github.com/symfony/polyfill-intl-icu instead of require ext-intl
.
"symfony/polyfill-intl-icu": "^1.0",
Hello
There is a bug in the libary where it mishandles the backslash character
For example
https://1337.karimrahal.com\.bla.com would give bla.com as the registered domain
where as the browser would lead to karimrahal.com domain
It seems that IPv6 addresses are not recognized properly:
>>> $w = new LayerShifter\TLDExtract\Extract()
=> LayerShifter\TLDExtract\Extract {#197}
>>> $res = $w->parse('2bf2:eaa0::7:314:5474')
=> LayerShifter\TLDExtract\Result {#205}
>>> $res->isIp()
=> false
>>> $res->isValidDomain()
=> false
How would I use this library without composer, which file do I need to include?
Thank you
Result {#217 ▼
-subdomain: null
-hostname: "youtube"
-suffix: "comxyzh"
}
I have tested with hundred of domains, but for the domain test.ru i get the following result:
Result {#1711 ▼
-subdomain: null
-hostname: "test.ru"
-suffix: null
}
I'm not able to find the the .tld_set database in "/vendor/layershifter/tld_extract/cache folder",
is there a way to download it?
Thanks for this extension.
I get this error:
Cannot put TLD list to cache file /xxxxx/vendor/layershifter/tld-extract/src/../cache/.tld_set, check writes rights on directory of file
So, I should change permission cache
directory in vendor
that this is not good job.
I suggest modify code for check permission and writable with is_writable()
and if don't have permission, try to change with chmod()
. or more better solution is auto created cache
directory with mkdir('cache', 0777);
command.
Seems related to #25
Parsing the hostname _sub.mydomain.com
results in the following:
{
"subdomain": "_sub",
"hostname": "mydomain",
"suffix": "com"
}
Parsing the hostname sub_.mydomain.com
results in this:
{
"subdomain": "sub_.mydomain",
"hostname": "com",
"suffix": null
}
Which isn't great if you have an otherwise valid TLD and domain, and you're trying to parse out subdomains separately.
Changing the following regex should allow this:
const HOSTNAME_PATTERN = '#^((?!-)[a-z0-9_-]{0,62}[a-z0-9_]\.)+[a-z]{2,63}|[xn\-\-a-z0-9]]{6,63}$#';
Ref.
I need to investigate and fix the incorrect operation of the parser when processing these domains:
#test.com
test.com#test_test
This doesn't seem to work for any blogspot subdomain..
test.blogspot.com gives as registrable domain: test.blogspot.com
test.github.com gives as registrable domain: github.com
I don't understand how it would work that way?
I need to extract domain names so that example.blogspot.com becomes blogspot.com and so on.
I noticed that some hostnames are not parsed correctly probably, see these examples:
profound.eu.org
website.us.org
activia.us.com
Using Extract::MODE_ALLOW_ICCAN they are parsed like:
LayerShifter\TLDExtract\Result Object
(
[subdomain:LayerShifter\TLDExtract\Result:private] => activia
[hostname:LayerShifter\TLDExtract\Result:private] => us
[suffix:LayerShifter\TLDExtract\Result:private] => com
)
But instead I think they should be parsed like:
LayerShifter\TLDExtract\Result Object
(
[subdomain:LayerShifter\TLDExtract\Result:private] =>
[hostname:LayerShifter\TLDExtract\Result:private] => activia
[suffix:LayerShifter\TLDExtract\Result:private] => us.com
)
I think they should be handled same as example.blogspot.com where example is the subdomain, blogspot is the hostname and com is the suffix (this with MODE_ALLOW_ICCAN of course). Websites with *.us.org and *.us.com can be registered here:
http://us.org/
http://www.us.com/
What are your thoughts?
Hi,
My code is as follows:
$line = 'http://www.upfile.co.il/xxxxxxxx.html'; $extract = new LayerShifter\TLDExtract\Extract(); $domain = $extract->parse($line); $domain->getRegistrableDomain();
I get www.upfile.co.il
instead of upfile.co.il
.
Any idea what is going wrong?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.