Coder Social home page Coder Social logo

layershifter / tldextract Goto Github PK

View Code? Open in Web Editor NEW
216.0 12.0 34.0 121 KB

[DEPRECATED] Library for extraction of domain parts e.g. TLD. Domain parser that uses Public Suffix List

License: Apache License 2.0

PHP 90.98% Shell 4.03% Dockerfile 4.99%
php-library tldextract domain-parser public-suffix-list php subdomain tld

tldextract's Introduction

DEPRECATED

Consider to use https://github.com/jeremykendall/php-domain-parser as maintained alternative.

TLDExtract

TLDExtract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL, e.g. domain parser. For example, say you want just the 'google' part of 'http://www.google.com'.

Latest Version on Packagist Software License Build Status Coverage Status Total Downloads


Everybody gets this wrong. Splitting on the '.' and taking the last 2 elements goes a long way only if you're thinking of simple e.g. .com domains. Think parsing http://forums.bbc.co.uk for example: the naive splitting method above will give you 'co' as the domain and 'uk' as the TLD, instead of 'bbc' and 'co.uk' respectively.

TLDExtract on the other hand knows what all gTLDs and ccTLDs look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

$result = tld_extract('http://forums.news.cnn.com/');
var_dump($result);

object(LayerShifter\TLDExtract\Result)#34 (3) {
  ["subdomain":"LayerShifter\TLDExtract\Result":private]=>
  string(11) "forums.news"
  ["hostname":"LayerShifter\TLDExtract\Result":private]=>
  string(3) "cnn"
  ["suffix":"LayerShifter\TLDExtract\Result":private]=>
  string(3) "com"
}

Result implements ArrayAccess interface, so you simple can access to its result.

var_dump($result['subdomain']);
string(11) "forums.news"
var_dump($result['hostname']);
string(3) "cnn"
var_dump($result['suffix']);
string(3) "com"

Also you can simply convert result to JSON.

var_dump($result->toJson());
string(54) "{"subdomain":"forums.news","hostname":"cnn","suffix":"com"}"

This package is compliant with PSR-1, PSR-2, PSR-4. If you notice compliance oversights, please send a patch via pull request.

Does TLDExtract make requests to Public Suffix List website?

No. TLDExtract uses database from TLDDatabase that generated from Public Suffix List and updated regularly. It does not make any HTTP requests to parse or validate a domain.

Requirements

The following versions of PHP are supported.

  • PHP 5.5
  • PHP 5.6
  • PHP 7.0
  • PHP 7.1
  • PHP 7.2
  • PHP 7.3
  • HHVM

Install

Via Composer

$ composer require layershifter/tld-extract

Additional result methods

Class LayerShifter\TLDExtract\Result has some usable methods:

$extract = new LayerShifter\TLDExtract\Extract();

# For domain 'shop.github.com'

$result = $extract->parse('shop.github.com');
$result->getFullHost(); // will return (string) 'shop.github.com'
$result->getRegistrableDomain(); // will return (string) 'github.com'
$result->isValidDomain(); // will return (bool) true
$result->isIp(); // will return (bool) false

# For IP '192.168.0.1'

$result = $extract->parse('192.168.0.1');
$result->getFullHost(); // will return (string) '192.168.0.1'
$result->getRegistrableDomain(); // will return null
$result->isValidDomain(); // will return (bool) false
$result->isIp(); // will return (bool) true

Custom database

By default package is using database from TLDDatabase package, but you can override this behaviour simply:

new LayerShifter\TLDExtract\Extract(__DIR__ . '/cache/mydatabase.php');

For more details and how keep database updated TLDDatabase.

Implement own result

By default after parse you will receive object of LayerShifter\TLDExtract\Result class, but sometime you need own methods or additional functionality.

You can create own class that implements LayerShifter\TLDExtract\ResultInterface and use it as parse result.

class CustomResult implements LayerShifter\TLDExtract\ResultInterface {}

new LayerShifter\TLDExtract\Extract(null, CustomResult::class);

Parsing modes

Package has three modes of parsing:

  • allow ICANN suffixes (domains are those delegated by ICANN or part of the IANA root zone database);
  • allow private domains (domains are amendments submitted to Public Suffix List by the domain holder, as an expression of how they operate their domain security policy);
  • allow custom (domains that are not in list, but can be usable, for example: example, mycompany, etc).

For keeping compatibility with Public Suffix List ideas package runs in all these modes by default, but you can easily change this behavior:

use LayerShifter\TLDExtract\Extract;

new Extract(null, null, Extract::MODE_ALLOW_ICANN);
new Extract(null, null, Extract::MODE_ALLOW_PRIVATE);
new Extract(null, null, Extract::MODE_ALLOW_NOT_EXISTING_SUFFIXES);
new Extract(null, null, Extract::MODE_ALLOW_ICANN | Extract::MODE_ALLOW_PRIVATE);

Change log

Please see CHANGELOG for more information what has changed recently.

Testing

$ composer test

Contributing

Please see CONTRIBUTING and CONDUCT for details.

License

This library is released under the Apache 2.0 license. Please see License File for more information.

tldextract's People

Contributors

alexander-schranz avatar erwane avatar gman98ish avatar laszlof avatar layershifter avatar rohaq avatar scrutinizer-auto-fixer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tldextract's Issues

can't find the .tld_set database

I'm not able to find the the .tld_set database in "/vendor/layershifter/tld_extract/cache folder",
is there a way to download it?

test..com is valid domain

$result = $extract->parse('test..com');
$result->isValidDomain();  // gives true

Registrable Domain ist .com in this case... I think this domain name should be invalid.

Should getSubdomains return an empty array if no subdomains found?

As part of some validation requirements, I wanted to loop over subdomain labels, and since an array was the expected return of the getSubdomains() function, I assumed it would return an empty array if none were found.

Turns out it returns a null, meaning it broke any attempts to use it as the expected return type. I ended up working around this using a null coalescence like the following:

$subdomainLabels = $hostnameExtract->getSubdomains() ?? array();

So, should this function return an empty array by default? I'd personally prefer if it did, since it keeps the return type consistent - but others might have differing opinions.

Uncaught OutOfRangeException

Hi, I'm getting the error:

Stack trace:
Message: Uncaught OutOfRangeException: Unknown field "errors" in xxx/vendor/layershifter/tld-extract/src/Result.php:220
File: xxx/vendor/layershifter/tld-extract/src/Result.php
Line: 220

thrown
#0 LayerShifter\TLDExtract\Result->__get('errors')

My code is straight from the example:

$extract = new LayerShifter\TLDExtract\Extract();
$result = $extract->parse('shop.github.com');
$result->getRegistrableDomain();

Composer install details:

Using version ^2.0 for layershifter/tld-extract
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Package operations: 5 installs, 0 updates, 0 removals
  - Installing symfony/polyfill-php72 (v1.10.0): Downloading (100%)         
  - Installing symfony/polyfill-intl-idn (v1.10.0): Downloading (100%)         
  - Installing layershifter/tld-support (1.1.1): Downloading (100%)         
  - Installing layershifter/tld-database (1.0.65): Downloading (100%)         
  - Installing layershifter/tld-extract (2.0.1): Downloading (100%)   

Am I missing something or doing something wrong?

Many thanks,

jkns

Parsing .us.com .us.org .eu.org with MODE_ALLOW_ICCAN

I need to extract domain names so that example.blogspot.com becomes blogspot.com and so on.

I noticed that some hostnames are not parsed correctly probably, see these examples:

profound.eu.org
website.us.org
activia.us.com

Using Extract::MODE_ALLOW_ICCAN they are parsed like:

LayerShifter\TLDExtract\Result Object
(
    [subdomain:LayerShifter\TLDExtract\Result:private] => activia
    [hostname:LayerShifter\TLDExtract\Result:private] => us
    [suffix:LayerShifter\TLDExtract\Result:private] => com
)

But instead I think they should be parsed like:

LayerShifter\TLDExtract\Result Object
(
    [subdomain:LayerShifter\TLDExtract\Result:private] => 
    [hostname:LayerShifter\TLDExtract\Result:private] => activia
    [suffix:LayerShifter\TLDExtract\Result:private] => us.com
)

I think they should be handled same as example.blogspot.com where example is the subdomain, blogspot is the hostname and com is the suffix (this with MODE_ALLOW_ICCAN of course). Websites with *.us.org and *.us.com can be registered here:

http://us.org/
http://www.us.com/

What are your thoughts?

TLD for domain test.ru is not recognized

I have tested with hundred of domains, but for the domain test.ru i get the following result:

Result {#1711 ▼
  -subdomain: null
  -hostname: "test.ru"
  -suffix: null
}

app_dev.php is a valid domain

Hello,

would it be possible to add an exception for this special name which is used in the very popular symfony framework?

Bye

Parser bug when subdomain has "-"

The parser fails for the following:

// If the subdomain has "-"
$url = 'https://s3-ap-southeast-2.amazonaws.com/blabla/blabla/wp-content/uploads/media/2019/03/16860571424_31c94205de_b.jpg';

// Extract domain parts
$extract = new \LayerShifter\TLDExtract\Extract();
$domainParser = $extract->parse($url);

parse_url($url, PHP_URL_HOST); // s3-ap-southeast-2.amazonaws.com
$domainParser->getSubdomain(); // null 

Impove IDN support

Package needs full IDN support, move it has problems with punycoded domain's suffixes.

IPv6 addresses are not recognized

It seems that IPv6 addresses are not recognized properly:

>>> $w = new LayerShifter\TLDExtract\Extract()
=> LayerShifter\TLDExtract\Extract {#197}
>>> $res = $w->parse('2bf2:eaa0::7:314:5474')
=> LayerShifter\TLDExtract\Result {#205}
>>> $res->isIp()
=> false
>>> $res->isValidDomain()
=> false

Underscore at end of hostname label causes parsing to fail

Seems related to #25

Parsing the hostname _sub.mydomain.com results in the following:

{
  "subdomain": "_sub",
  "hostname": "mydomain",
  "suffix": "com"
}

Parsing the hostname sub_.mydomain.com results in this:

{
  "subdomain": "sub_.mydomain",
  "hostname": "com",
  "suffix": null
}

Which isn't great if you have an otherwise valid TLD and domain, and you're trying to parse out subdomains separately.

Changing the following regex should allow this:

const HOSTNAME_PATTERN = '#^((?!-)[a-z0-9_-]{0,62}[a-z0-9_]\.)+[a-z]{2,63}|[xn\-\-a-z0-9]]{6,63}$#';

TLDExtract not properly parsing hostname

I'm running some domain names through TLDExtract and came across a domain not being properly parsed.

The URL is called blogspot.com

$url = 'blogspot.com';
$domain = tld_extract($url);
var_dump($domain);

Returns: 
object(LayerShifter\TLDExtract\Result)[9]
  private 'subdomain' => null
  private 'hostname' => string 'blogspot.com' (length=12)
  private 'suffix' => null

Weirdly the URL 'flogspot.com' works fine and returns:

object(LayerShifter\TLDExtract\Result)[9]
  private 'subdomain' => null
  private 'hostname' => string 'flogspot' (length=8)
  private 'suffix' => string 'com' (length=3)

The URL logspot.com also works and returns:

object(LayerShifter\TLDExtract\Result)[9]
  private 'subdomain' => null
  private 'hostname' => string 'logspot' (length=7)
  private 'suffix' => string 'com' (length=3)

Any idea why the TLD in 'blogspot.com' is not being added to the suffix? Is this a bug?

underscore in hostname

Hi,

i've found a weird behavior in domain extraction;

$extract = new Extract();
$result = $extract->parse('dkim._domainkey.phea.fr');
print_r($result->toArray());

$result = $extract->parse('dkim.domainkey.phea.fr');
print_r($result->toArray());

result

Array
(
    [subdomain] => dkim._domainkey.phea
    [hostname] => fr
    [suffix] => 
)
Array
(
    [subdomain] => dkim.domainkey
    [hostname] => phea
    [suffix] => fr
)

the problem come from _ character.

this regex fix the problem

    const HOSTNAME_PATTERN = '#^((?!-)[a-z0-9_-]{0,62}[a-z0-9]\.)+[a-z]{2,63}|[xn\-\-a-z0-9]]{6,63}$#';

Use without composer

How would I use this library without composer, which file do I need to include?

Thank you

Errors extracting from some subdomains

Hello,

Extracting from some subdomains is not working as expected.

For example, extracting from:

  • 'some-long-string.us-west-2.elb.amazonaws.com'
  • 'whatever.cloudfront.net'
  • 'whatever.googleapis.com'

I get things like:

    object(LayerShifter\TLDExtract\Result)[6]
      private 'subdomain' => null
      private 'hostname' => string 'some-long-string' (length=16)
      private 'suffix' => string 'us-west-2.elb.amazonaws.com' (length=27)

    object(LayerShifter\TLDExtract\Result)[5]
      private 'subdomain' => null
      private 'hostname' => string 'whatever' (length=8)
      private 'suffix' => string 'cloudfront.net' (length=14)

    object(LayerShifter\TLDExtract\Result)[7]
      private 'subdomain' => null
      private 'hostname' => string 'whatever' (length=8)
      private 'suffix' => string 'googleapis.com' (length=14)

Cache directory is not writable

Thanks for this extension.

I get this error:
Cannot put TLD list to cache file /xxxxx/vendor/layershifter/tld-extract/src/../cache/.tld_set, check writes rights on directory of file

So, I should change permission cache directory in vendor that this is not good job.

I suggest modify code for check permission and writable with is_writable() and if don't have permission, try to change with chmod(). or more better solution is auto created cache directory with mkdir('cache', 0777); command.

Support PHP 5.4.0 for version 0.2.0

Hello,

can you please change the guzzlehttp/guzzle package in your composer version 0.2.0?

guzzlehttp/guzzle version 6.* doesn't support PHP 5.4.0. But version 5.* does.
So can you change it to "guzzlehttp/guzzle": "^5.3.1",

and update the tag version to 0.2.1 ?

Thank for your help

Make Extract::suffixExists protected

Currently the suffixExists method on the Extract class is private, it would be useful if it was protected, so you could override default behavior and provide an alternate data source

isValidDomain returns true for domains that are too long

$url = 'http://exam-plewdgrfWEDRKGJHBSAFVHBJKSDAVJKHBDSVJKHBSDJKBVFJHBDSJHBSDJKBVASDJKHBASDJKHBFJKADBSVCJKHBASDCJHBJHASBJUHBEJDHBJASBDHABSDMNBASJKBHJWQHBDMNASBDJHEWQRJHBAJSHBMANSBDJHQWEDJHBASMNDBMASNDBMBgwqe5vqwerfvcqw4vtwergfkhabdsvjkhbqerjkhbajskdbfcjkahsgbecauhsdeclkjhslkjadhfjkasberflkjazdklcklaDSJASNDKLNAKSDJNFKLAJSDJNASKNF.com';

$extract = new LayerShifter\TLDExtract\Extract();
$result = $extract->parse($url);
$domain = $result->getRegistrableDomain();
echo $result->isValidDomain();

will return TRUE when it should return FLASE

Error with .co.il domain

Hi,

My code is as follows:
$line = 'http://www.upfile.co.il/xxxxxxxx.html'; $extract = new LayerShifter\TLDExtract\Extract(); $domain = $extract->parse($line); $domain->getRegistrableDomain();

I get www.upfile.co.il instead of upfile.co.il.

Any idea what is going wrong?

bug(Parser): incorrect result of getRegistrableDomain()

This doesn't seem to work for any blogspot subdomain..

test.blogspot.com gives as registrable domain: test.blogspot.com

test.github.com gives as registrable domain: github.com

I don't understand how it would work that way?

Extract registerable domain returns empty

I have a loop that extracts registerable domains. However, despite most of them running fine, a still have a lot of empty responses. What could be the cause that it does not find a registerable domain for the following examples :

mail-sor-f69.google.com (you would expect google.com)
static.vnpt.vn (you would expect vnpt.vn)
mx1.sub5.homie.mail.dreamhost.com (you would expect dreamhost.com)

and there are many , many, many more.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.