Coder Social home page Coder Social logo

layershifter / tldextract Goto Github PK

View Code? Open in Web Editor NEW
216.0 12.0 34.0 121 KB

[DEPRECATED] Library for extraction of domain parts e.g. TLD. Domain parser that uses Public Suffix List

License: Apache License 2.0

PHP 90.98% Shell 4.03% Dockerfile 4.99%
php-library tldextract domain-parser public-suffix-list php subdomain tld

tldextract's Issues

Impove IDN support

Package needs full IDN support, move it has problems with punycoded domain's suffixes.

TLDExtract not properly parsing hostname

I'm running some domain names through TLDExtract and came across a domain not being properly parsed.

The URL is called blogspot.com

$url = 'blogspot.com';
$domain = tld_extract($url);
var_dump($domain);

Returns: 
object(LayerShifter\TLDExtract\Result)[9]
  private 'subdomain' => null
  private 'hostname' => string 'blogspot.com' (length=12)
  private 'suffix' => null

Weirdly the URL 'flogspot.com' works fine and returns:

object(LayerShifter\TLDExtract\Result)[9]
  private 'subdomain' => null
  private 'hostname' => string 'flogspot' (length=8)
  private 'suffix' => string 'com' (length=3)

The URL logspot.com also works and returns:

object(LayerShifter\TLDExtract\Result)[9]
  private 'subdomain' => null
  private 'hostname' => string 'logspot' (length=7)
  private 'suffix' => string 'com' (length=3)

Any idea why the TLD in 'blogspot.com' is not being added to the suffix? Is this a bug?

Parser bug when subdomain has "-"

The parser fails for the following:

// If the subdomain has "-"
$url = 'https://s3-ap-southeast-2.amazonaws.com/blabla/blabla/wp-content/uploads/media/2019/03/16860571424_31c94205de_b.jpg';

// Extract domain parts
$extract = new \LayerShifter\TLDExtract\Extract();
$domainParser = $extract->parse($url);

parse_url($url, PHP_URL_HOST); // s3-ap-southeast-2.amazonaws.com
$domainParser->getSubdomain(); // null 

app_dev.php is a valid domain

Hello,

would it be possible to add an exception for this special name which is used in the very popular symfony framework?

Bye

Errors extracting from some subdomains

Hello,

Extracting from some subdomains is not working as expected.

For example, extracting from:

  • 'some-long-string.us-west-2.elb.amazonaws.com'
  • 'whatever.cloudfront.net'
  • 'whatever.googleapis.com'

I get things like:

    object(LayerShifter\TLDExtract\Result)[6]
      private 'subdomain' => null
      private 'hostname' => string 'some-long-string' (length=16)
      private 'suffix' => string 'us-west-2.elb.amazonaws.com' (length=27)

    object(LayerShifter\TLDExtract\Result)[5]
      private 'subdomain' => null
      private 'hostname' => string 'whatever' (length=8)
      private 'suffix' => string 'cloudfront.net' (length=14)

    object(LayerShifter\TLDExtract\Result)[7]
      private 'subdomain' => null
      private 'hostname' => string 'whatever' (length=8)
      private 'suffix' => string 'googleapis.com' (length=14)

Uncaught OutOfRangeException

Hi, I'm getting the error:

Stack trace:
Message: Uncaught OutOfRangeException: Unknown field "errors" in xxx/vendor/layershifter/tld-extract/src/Result.php:220
File: xxx/vendor/layershifter/tld-extract/src/Result.php
Line: 220

thrown
#0 LayerShifter\TLDExtract\Result->__get('errors')

My code is straight from the example:

$extract = new LayerShifter\TLDExtract\Extract();
$result = $extract->parse('shop.github.com');
$result->getRegistrableDomain();

Composer install details:

Using version ^2.0 for layershifter/tld-extract
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Package operations: 5 installs, 0 updates, 0 removals
  - Installing symfony/polyfill-php72 (v1.10.0): Downloading (100%)         
  - Installing symfony/polyfill-intl-idn (v1.10.0): Downloading (100%)         
  - Installing layershifter/tld-support (1.1.1): Downloading (100%)         
  - Installing layershifter/tld-database (1.0.65): Downloading (100%)         
  - Installing layershifter/tld-extract (2.0.1): Downloading (100%)   

Am I missing something or doing something wrong?

Many thanks,

jkns

Support PHP 5.4.0 for version 0.2.0

Hello,

can you please change the guzzlehttp/guzzle package in your composer version 0.2.0?

guzzlehttp/guzzle version 6.* doesn't support PHP 5.4.0. But version 5.* does.
So can you change it to "guzzlehttp/guzzle": "^5.3.1",

and update the tag version to 0.2.1 ?

Thank for your help

underscore in hostname

Hi,

i've found a weird behavior in domain extraction;

$extract = new Extract();
$result = $extract->parse('dkim._domainkey.phea.fr');
print_r($result->toArray());

$result = $extract->parse('dkim.domainkey.phea.fr');
print_r($result->toArray());

result

Array
(
    [subdomain] => dkim._domainkey.phea
    [hostname] => fr
    [suffix] => 
)
Array
(
    [subdomain] => dkim.domainkey
    [hostname] => phea
    [suffix] => fr
)

the problem come from _ character.

this regex fix the problem

    const HOSTNAME_PATTERN = '#^((?!-)[a-z0-9_-]{0,62}[a-z0-9]\.)+[a-z]{2,63}|[xn\-\-a-z0-9]]{6,63}$#';

Make Extract::suffixExists protected

Currently the suffixExists method on the Extract class is private, it would be useful if it was protected, so you could override default behavior and provide an alternate data source

Should getSubdomains return an empty array if no subdomains found?

As part of some validation requirements, I wanted to loop over subdomain labels, and since an array was the expected return of the getSubdomains() function, I assumed it would return an empty array if none were found.

Turns out it returns a null, meaning it broke any attempts to use it as the expected return type. I ended up working around this using a null coalescence like the following:

$subdomainLabels = $hostnameExtract->getSubdomains() ?? array();

So, should this function return an empty array by default? I'd personally prefer if it did, since it keeps the return type consistent - but others might have differing opinions.

test..com is valid domain

$result = $extract->parse('test..com');
$result->isValidDomain();  // gives true

Registrable Domain ist .com in this case... I think this domain name should be invalid.

isValidDomain returns true for domains that are too long

$url = 'http://exam-plewdgrfWEDRKGJHBSAFVHBJKSDAVJKHBDSVJKHBSDJKBVFJHBDSJHBSDJKBVASDJKHBASDJKHBFJKADBSVCJKHBASDCJHBJHASBJUHBEJDHBJASBDHABSDMNBASJKBHJWQHBDMNASBDJHEWQRJHBAJSHBMANSBDJHQWEDJHBASMNDBMASNDBMBgwqe5vqwerfvcqw4vtwergfkhabdsvjkhbqerjkhbajskdbfcjkahsgbecauhsdeclkjhslkjadhfjkasberflkjazdklcklaDSJASNDKLNAKSDJNFKLAJSDJNASKNF.com';

$extract = new LayerShifter\TLDExtract\Extract();
$result = $extract->parse($url);
$domain = $result->getRegistrableDomain();
echo $result->isValidDomain();

will return TRUE when it should return FLASE

Extract registerable domain returns empty

I have a loop that extracts registerable domains. However, despite most of them running fine, a still have a lot of empty responses. What could be the cause that it does not find a registerable domain for the following examples :

mail-sor-f69.google.com (you would expect google.com)
static.vnpt.vn (you would expect vnpt.vn)
mx1.sub5.homie.mail.dreamhost.com (you would expect dreamhost.com)

and there are many , many, many more.

IPv6 addresses are not recognized

It seems that IPv6 addresses are not recognized properly:

>>> $w = new LayerShifter\TLDExtract\Extract()
=> LayerShifter\TLDExtract\Extract {#197}
>>> $res = $w->parse('2bf2:eaa0::7:314:5474')
=> LayerShifter\TLDExtract\Result {#205}
>>> $res->isIp()
=> false
>>> $res->isValidDomain()
=> false

Use without composer

How would I use this library without composer, which file do I need to include?

Thank you

TLD for domain test.ru is not recognized

I have tested with hundred of domains, but for the domain test.ru i get the following result:

Result {#1711 ▼
  -subdomain: null
  -hostname: "test.ru"
  -suffix: null
}

can't find the .tld_set database

I'm not able to find the the .tld_set database in "/vendor/layershifter/tld_extract/cache folder",
is there a way to download it?

Cache directory is not writable

Thanks for this extension.

I get this error:
Cannot put TLD list to cache file /xxxxx/vendor/layershifter/tld-extract/src/../cache/.tld_set, check writes rights on directory of file

So, I should change permission cache directory in vendor that this is not good job.

I suggest modify code for check permission and writable with is_writable() and if don't have permission, try to change with chmod(). or more better solution is auto created cache directory with mkdir('cache', 0777); command.

Underscore at end of hostname label causes parsing to fail

Seems related to #25

Parsing the hostname _sub.mydomain.com results in the following:

{
  "subdomain": "_sub",
  "hostname": "mydomain",
  "suffix": "com"
}

Parsing the hostname sub_.mydomain.com results in this:

{
  "subdomain": "sub_.mydomain",
  "hostname": "com",
  "suffix": null
}

Which isn't great if you have an otherwise valid TLD and domain, and you're trying to parse out subdomains separately.

Changing the following regex should allow this:

const HOSTNAME_PATTERN = '#^((?!-)[a-z0-9_-]{0,62}[a-z0-9_]\.)+[a-z]{2,63}|[xn\-\-a-z0-9]]{6,63}$#';

bug(Parser): incorrect result of getRegistrableDomain()

This doesn't seem to work for any blogspot subdomain..

test.blogspot.com gives as registrable domain: test.blogspot.com

test.github.com gives as registrable domain: github.com

I don't understand how it would work that way?

Parsing .us.com .us.org .eu.org with MODE_ALLOW_ICCAN

I need to extract domain names so that example.blogspot.com becomes blogspot.com and so on.

I noticed that some hostnames are not parsed correctly probably, see these examples:

profound.eu.org
website.us.org
activia.us.com

Using Extract::MODE_ALLOW_ICCAN they are parsed like:

LayerShifter\TLDExtract\Result Object
(
    [subdomain:LayerShifter\TLDExtract\Result:private] => activia
    [hostname:LayerShifter\TLDExtract\Result:private] => us
    [suffix:LayerShifter\TLDExtract\Result:private] => com
)

But instead I think they should be parsed like:

LayerShifter\TLDExtract\Result Object
(
    [subdomain:LayerShifter\TLDExtract\Result:private] => 
    [hostname:LayerShifter\TLDExtract\Result:private] => activia
    [suffix:LayerShifter\TLDExtract\Result:private] => us.com
)

I think they should be handled same as example.blogspot.com where example is the subdomain, blogspot is the hostname and com is the suffix (this with MODE_ALLOW_ICCAN of course). Websites with *.us.org and *.us.com can be registered here:

http://us.org/
http://www.us.com/

What are your thoughts?

Error with .co.il domain

Hi,

My code is as follows:
$line = 'http://www.upfile.co.il/xxxxxxxx.html'; $extract = new LayerShifter\TLDExtract\Extract(); $domain = $extract->parse($line); $domain->getRegistrableDomain();

I get www.upfile.co.il instead of upfile.co.il.

Any idea what is going wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.