layershifter / tldextract Goto Github PK

View Code? Open in Web Editor NEW

216.0 12.0 34.0 121 KB

[DEPRECATED] Library for extraction of domain parts e.g. TLD. Domain parser that uses Public Suffix List

License: Apache License 2.0

PHP 90.98% Shell 4.03% Dockerfile 4.99%

php-library tldextract domain-parser public-suffix-list php subdomain tld

tldextract's Introduction

DEPRECATED

Consider to use https://github.com/jeremykendall/php-domain-parser as maintained alternative.

TLDExtract

TLDExtract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL, e.g. domain parser. For example, say you want just the 'google' part of 'http://www.google.com'.

Everybody gets this wrong. Splitting on the '.' and taking the last 2 elements goes a long way only if you're thinking of simple e.g. .com domains. Think parsing http://forums.bbc.co.uk for example: the naive splitting method above will give you 'co' as the domain and 'uk' as the TLD, instead of 'bbc' and 'co.uk' respectively.

TLDExtract on the other hand knows what all gTLDs and ccTLDs look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

$result = tld_extract('http://forums.news.cnn.com/');
var_dump($result);

object(LayerShifter\TLDExtract\Result)#34 (3) {
  ["subdomain":"LayerShifter\TLDExtract\Result":private]=>
  string(11) "forums.news"
  ["hostname":"LayerShifter\TLDExtract\Result":private]=>
  string(3) "cnn"
  ["suffix":"LayerShifter\TLDExtract\Result":private]=>
  string(3) "com"
}

Result implements ArrayAccess interface, so you simple can access to its result.

var_dump($result['subdomain']);
string(11) "forums.news"
var_dump($result['hostname']);
string(3) "cnn"
var_dump($result['suffix']);
string(3) "com"

Also you can simply convert result to JSON.

var_dump($result->toJson());
string(54) "{"subdomain":"forums.news","hostname":"cnn","suffix":"com"}"

This package is compliant with PSR-1, PSR-2, PSR-4. If you notice compliance oversights, please send a patch via pull request.

Does TLDExtract make requests to Public Suffix List website?

No. TLDExtract uses database from TLDDatabase that generated from Public Suffix List and updated regularly. It does not make any HTTP requests to parse or validate a domain.

Requirements

The following versions of PHP are supported.

PHP 5.5
PHP 5.6
PHP 7.0
PHP 7.1
PHP 7.2
PHP 7.3
HHVM

Install

Via Composer

$ composer require layershifter/tld-extract

Additional result methods

Class LayerShifter\TLDExtract\Result has some usable methods:

$extract = new LayerShifter\TLDExtract\Extract();

# For domain 'shop.github.com'

$result = $extract->parse('shop.github.com');
$result->getFullHost(); // will return (string) 'shop.github.com'
$result->getRegistrableDomain(); // will return (string) 'github.com'
$result->isValidDomain(); // will return (bool) true
$result->isIp(); // will return (bool) false

# For IP '192.168.0.1'

$result = $extract->parse('192.168.0.1');
$result->getFullHost(); // will return (string) '192.168.0.1'
$result->getRegistrableDomain(); // will return null
$result->isValidDomain(); // will return (bool) false
$result->isIp(); // will return (bool) true

Custom database

By default package is using database from TLDDatabase package, but you can override this behaviour simply:

new LayerShifter\TLDExtract\Extract(__DIR__ . '/cache/mydatabase.php');

For more details and how keep database updated TLDDatabase.

Implement own result

By default after parse you will receive object of LayerShifter\TLDExtract\Result class, but sometime you need own methods or additional functionality.

You can create own class that implements LayerShifter\TLDExtract\ResultInterface and use it as parse result.

class CustomResult implements LayerShifter\TLDExtract\ResultInterface {}

new LayerShifter\TLDExtract\Extract(null, CustomResult::class);

Parsing modes

Package has three modes of parsing:

allow ICANN suffixes (domains are those delegated by ICANN or part of the IANA root zone database);
allow private domains (domains are amendments submitted to Public Suffix List by the domain holder, as an expression of how they operate their domain security policy);
allow custom (domains that are not in list, but can be usable, for example: example, mycompany, etc).

For keeping compatibility with Public Suffix List ideas package runs in all these modes by default, but you can easily change this behavior:

use LayerShifter\TLDExtract\Extract;

new Extract(null, null, Extract::MODE_ALLOW_ICANN);
new Extract(null, null, Extract::MODE_ALLOW_PRIVATE);
new Extract(null, null, Extract::MODE_ALLOW_NOT_EXISTING_SUFFIXES);
new Extract(null, null, Extract::MODE_ALLOW_ICANN | Extract::MODE_ALLOW_PRIVATE);

Change log

Please see CHANGELOG for more information what has changed recently.

Testing

$ composer test

Contributing

Please see CONTRIBUTING and CONDUCT for details.

License

This library is released under the Apache 2.0 license. Please see License File for more information.

tldextract's People

Contributors

Stargazers

Watchers

tldextract's Issues

Error with .co.il domain

Hi,

My code is as follows:
$line = 'http://www.upfile.co.il/xxxxxxxx.html'; $extract = new LayerShifter\TLDExtract\Extract(); $domain = $extract->parse($line); $domain->getRegistrableDomain();

I get www.upfile.co.il instead of upfile.co.il.

Any idea what is going wrong?

can't find the .tld_set database

I'm not able to find the the .tld_set database in "/vendor/layershifter/tld_extract/cache folder",
is there a way to download it?

https://youtube.comxyzh is a valid domain

Result {#217 ▼
-subdomain: null
-hostname: "youtube"
-suffix: "comxyzh"
}

local file being interpreted as a tld

Incorrect Parsing of Malicious Payload

Hello
There is a bug in the libary where it mishandles the backslash character

For example
https://1337.karimrahal.com\.bla.com would give bla.com as the registered domain

where as the browser would lead to karimrahal.com domain

Get rid of the intl extension requirement

I use this library inside a symfony project where intl is not required because symfony give a polyfill for it.

Currently its not possible to use this library without intl installed (composer will error). I would recommend to use the symfony intl polyfill https://github.com/symfony/polyfill-intl-icu instead of require ext-intl.

"symfony/polyfill-intl-icu": "^1.0",

TLD for domain test.ru is not recognized

I have tested with hundred of domains, but for the domain test.ru i get the following result:

Result {#1711 ▼
  -subdomain: null
  -hostname: "test.ru"
  -suffix: null
}

Write tests

Write tests to complete 100% code coverage
Package must complete PSL test: http://mxr.mozilla.org/mozilla-central/source/netwerk/test/unit/data/test_psl.txt?raw=1

Errors extracting from some subdomains

Hello,

Extracting from some subdomains is not working as expected.

For example, extracting from:

'some-long-string.us-west-2.elb.amazonaws.com'
'whatever.cloudfront.net'
'whatever.googleapis.com'

I get things like:

    object(LayerShifter\TLDExtract\Result)[6]
      private 'subdomain' => null
      private 'hostname' => string 'some-long-string' (length=16)
      private 'suffix' => string 'us-west-2.elb.amazonaws.com' (length=27)

    object(LayerShifter\TLDExtract\Result)[5]
      private 'subdomain' => null
      private 'hostname' => string 'whatever' (length=8)
      private 'suffix' => string 'cloudfront.net' (length=14)

    object(LayerShifter\TLDExtract\Result)[7]
      private 'subdomain' => null
      private 'hostname' => string 'whatever' (length=8)
      private 'suffix' => string 'googleapis.com' (length=14)

bug(Parser): incorrect result of getRegistrableDomain()

This doesn't seem to work for any blogspot subdomain..

test.blogspot.com gives as registrable domain: test.blogspot.com

test.github.com gives as registrable domain: github.com

I don't understand how it would work that way?

isValidDomain returns true for domains that are too long

$url = 'http://exam-plewdgrfWEDRKGJHBSAFVHBJKSDAVJKHBDSVJKHBSDJKBVFJHBDSJHBSDJKBVASDJKHBASDJKHBFJKADBSVCJKHBASDCJHBJHASBJUHBEJDHBJASBDHABSDMNBASJKBHJWQHBDMNASBDJHEWQRJHBAJSHBMANSBDJHQWEDJHBASMNDBMASNDBMBgwqe5vqwerfvcqw4vtwergfkhabdsvjkhbqerjkhbajskdbfcjkahsgbecauhsdeclkjhslkjadhfjkasberflkjazdklcklaDSJASNDKLNAKSDJNFKLAJSDJNASKNF.com';

$extract = new LayerShifter\TLDExtract\Extract();
$result = $extract->parse($url);
$domain = $result->getRegistrableDomain();
echo $result->isValidDomain();

will return TRUE when it should return FLASE

Parsing .us.com .us.org .eu.org with MODE_ALLOW_ICCAN

I need to extract domain names so that example.blogspot.com becomes blogspot.com and so on.

I noticed that some hostnames are not parsed correctly probably, see these examples:

profound.eu.org
website.us.org
activia.us.com

Using Extract::MODE_ALLOW_ICCAN they are parsed like:

LayerShifter\TLDExtract\Result Object
(
    [subdomain:LayerShifter\TLDExtract\Result:private] => activia
    [hostname:LayerShifter\TLDExtract\Result:private] => us
    [suffix:LayerShifter\TLDExtract\Result:private] => com
)

But instead I think they should be parsed like:

LayerShifter\TLDExtract\Result Object
(
    [subdomain:LayerShifter\TLDExtract\Result:private] => 
    [hostname:LayerShifter\TLDExtract\Result:private] => activia
    [suffix:LayerShifter\TLDExtract\Result:private] => us.com
)

I think they should be handled same as example.blogspot.com where example is the subdomain, blogspot is the hostname and com is the suffix (this with MODE_ALLOW_ICCAN of course). Websites with *.us.org and *.us.com can be registered here:

http://us.org/
http://www.us.com/

What are your thoughts?

"@" symbol should not be allowed in domain name

$extract = new Extract();
$result = $extract->parse("test@[email protected]")->isValidDomain();

This results in true, when it should be false ("@" is not allowed in domain names)

Doesn't correctly handle URLs with a "?" just after the domain

For "http://example.com?foo=bar", getRegistrableDomain() returns "example.com?foo=bar" rather than "example.com".

My temporary quick fix in my code is to replace all "?" characters with "/" before passing the URL to TLDExtract.

Cache directory is not writable

Thanks for this extension.

I get this error:
Cannot put TLD list to cache file /xxxxx/vendor/layershifter/tld-extract/src/../cache/.tld_set, check writes rights on directory of file

So, I should change permission cache directory in vendor that this is not good job.

I suggest modify code for check permission and writable with is_writable() and if don't have permission, try to change with chmod(). or more better solution is auto created cache directory with mkdir('cache', 0777); command.

Underscore at end of hostname label causes parsing to fail

Seems related to #25

Parsing the hostname _sub.mydomain.com results in the following:

{
  "subdomain": "_sub",
  "hostname": "mydomain",
  "suffix": "com"
}

Parsing the hostname sub_.mydomain.com results in this:

{
  "subdomain": "sub_.mydomain",
  "hostname": "com",
  "suffix": null
}

Which isn't great if you have an otherwise valid TLD and domain, and you're trying to parse out subdomains separately.

Changing the following regex should allow this:

const HOSTNAME_PATTERN = '#^((?!-)[a-z0-9_-]{0,62}[a-z0-9_]\.)+[a-z]{2,63}|[xn\-\-a-z0-9]]{6,63}$#';

Should getSubdomains return an empty array if no subdomains found?

As part of some validation requirements, I wanted to loop over subdomain labels, and since an array was the expected return of the getSubdomains() function, I assumed it would return an empty array if none were found.

Turns out it returns a null, meaning it broke any attempts to use it as the expected return type. I ended up working around this using a null coalescence like the following:

$subdomainLabels = $hostnameExtract->getSubdomains() ?? array();

So, should this function return an empty array by default? I'd personally prefer if it did, since it keeps the return type consistent - but others might have differing opinions.

test..com is valid domain

$result = $extract->parse('test..com');
$result->isValidDomain();  // gives true

Registrable Domain ist .com in this case... I think this domain name should be invalid.

bug(Parser): Incorrect parsing domains with number sign

Ref.

I need to investigate and fix the incorrect operation of the parser when processing these domains:

#test.com
test.com#test_test

Use without composer

How would I use this library without composer, which file do I need to include?

Thank you

RFC

Some reading for improve or fix in case it’s needed :

https://en.wikipedia.org/wiki/Domain_name
introduce
https://tools.ietf.org/html/rfc1034
https://tools.ietf.org/html/rfc1035

domain name are URL
https://en.wikipedia.org/wiki/Url
introduce
https://url.spec.whatwg.org/

should permit to close #45

Impove IDN support

Package needs full IDN support, move it has problems with punycoded domain's suffixes.

IPv6 addresses are not recognized

It seems that IPv6 addresses are not recognized properly:

>>> $w = new LayerShifter\TLDExtract\Extract()
=> LayerShifter\TLDExtract\Extract {#197}
>>> $res = $w->parse('2bf2:eaa0::7:314:5474')
=> LayerShifter\TLDExtract\Result {#205}
>>> $res->isIp()
=> false
>>> $res->isValidDomain()
=> false

Parser bug when subdomain has "-"

The parser fails for the following:

// If the subdomain has "-"
$url = 'https://s3-ap-southeast-2.amazonaws.com/blabla/blabla/wp-content/uploads/media/2019/03/16860571424_31c94205de_b.jpg';

// Extract domain parts
$extract = new \LayerShifter\TLDExtract\Extract();
$domainParser = $extract->parse($url);

parse_url($url, PHP_URL_HOST); // s3-ap-southeast-2.amazonaws.com
$domainParser->getSubdomain(); // null

underscore in hostname

Hi,

i've found a weird behavior in domain extraction;

$extract = new Extract();
$result = $extract->parse('dkim._domainkey.phea.fr');
print_r($result->toArray());

$result = $extract->parse('dkim.domainkey.phea.fr');
print_r($result->toArray());

result

Array
(
    [subdomain] => dkim._domainkey.phea
    [hostname] => fr
    [suffix] => 
)
Array
(
    [subdomain] => dkim.domainkey
    [hostname] => phea
    [suffix] => fr
)

the problem come from _ character.

this regex fix the problem

    const HOSTNAME_PATTERN = '#^((?!-)[a-z0-9_-]{0,62}[a-z0-9]\.)+[a-z]{2,63}|[xn\-\-a-z0-9]]{6,63}$#';

TLDExtract not properly parsing hostname

I'm running some domain names through TLDExtract and came across a domain not being properly parsed.

The URL is called blogspot.com

$url = 'blogspot.com';
$domain = tld_extract($url);
var_dump($domain);

Returns: 
object(LayerShifter\TLDExtract\Result)[9]
  private 'subdomain' => null
  private 'hostname' => string 'blogspot.com' (length=12)
  private 'suffix' => null

Weirdly the URL 'flogspot.com' works fine and returns:

object(LayerShifter\TLDExtract\Result)[9]
  private 'subdomain' => null
  private 'hostname' => string 'flogspot' (length=8)
  private 'suffix' => string 'com' (length=3)

The URL logspot.com also works and returns:

object(LayerShifter\TLDExtract\Result)[9]
  private 'subdomain' => null
  private 'hostname' => string 'logspot' (length=7)
  private 'suffix' => string 'com' (length=3)

Any idea why the TLD in 'blogspot.com' is not being added to the suffix? Is this a bug?

app_dev.php is a valid domain

Hello,

would it be possible to add an exception for this special name which is used in the very popular symfony framework?

Bye

Extract registerable domain returns empty

I have a loop that extracts registerable domains. However, despite most of them running fine, a still have a lot of empty responses. What could be the cause that it does not find a registerable domain for the following examples :

mail-sor-f69.google.com (you would expect google.com)
static.vnpt.vn (you would expect vnpt.vn)
mx1.sub5.homie.mail.dreamhost.com (you would expect dreamhost.com)

and there are many , many, many more.

Uncaught OutOfRangeException

Hi, I'm getting the error:

Stack trace:
Message: Uncaught OutOfRangeException: Unknown field "errors" in xxx/vendor/layershifter/tld-extract/src/Result.php:220
File: xxx/vendor/layershifter/tld-extract/src/Result.php
Line: 220

thrown
#0 LayerShifter\TLDExtract\Result->__get('errors')

My code is straight from the example:

$extract = new LayerShifter\TLDExtract\Extract();
$result = $extract->parse('shop.github.com');
$result->getRegistrableDomain();

Composer install details:

Using version ^2.0 for layershifter/tld-extract
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Package operations: 5 installs, 0 updates, 0 removals
  - Installing symfony/polyfill-php72 (v1.10.0): Downloading (100%)         
  - Installing symfony/polyfill-intl-idn (v1.10.0): Downloading (100%)         
  - Installing layershifter/tld-support (1.1.1): Downloading (100%)         
  - Installing layershifter/tld-database (1.0.65): Downloading (100%)         
  - Installing layershifter/tld-extract (2.0.1): Downloading (100%)

Am I missing something or doing something wrong?

Many thanks,

jkns

Support PHP 5.4.0 for version 0.2.0

Hello,

can you please change the guzzlehttp/guzzle package in your composer version 0.2.0?

guzzlehttp/guzzle version 6.* doesn't support PHP 5.4.0. But version 5.* does.
So can you change it to "guzzlehttp/guzzle": "^5.3.1",

and update the tag version to 0.2.1 ?

Thank for your help

Make Extract::suffixExists protected

Currently the suffixExists method on the Extract class is private, it would be useful if it was protected, so you could override default behavior and provide an alternate data source

Replace punycode with the symfony polyfill

Symfony now provide a official polyfill which is maintained by them also described on the punycode github page as activitly maintained alternative: https://github.com/symfony/polyfill-intl-idn

layershifter / tldextract Goto Github PK

tldextract's Introduction

DEPRECATED

TLDExtract

Does TLDExtract make requests to Public Suffix List website?

Requirements

Install

Additional result methods

Custom database

Implement own result

Parsing modes

Change log

Testing

Contributing

License

tldextract's People

Contributors

Stargazers

Watchers

Forkers

tldextract's Issues

Recommend Projects

Recommend Topics

Recommend Org