weppos / publicsuffix-ruby Goto Github PK

View Code? Open in Web Editor NEW

611.0 16.0 111.0 1.27 MB

Domain name parser for Ruby based on the Public Suffix List.

Home Page: https://simonecarletti.com/code/publicsuffix

License: MIT License

Ruby 100.00%

ruby publicsuffix psl

publicsuffix-ruby's Introduction

Public Suffix for Ruby

PublicSuffix is a Ruby domain name parser based on the Public Suffix List.

Requirements

PublicSuffix requires Ruby >= 3.0. For an older versions of Ruby use a previous release.

Installation

You can install the gem manually:

gem install public_suffix

Or use Bundler and define it as a dependency in your Gemfile:

gem 'public_suffix'

Usage

Extract the domain out from a name:

PublicSuffix.domain("google.com")
# => "google.com"
PublicSuffix.domain("www.google.com")
# => "google.com"
PublicSuffix.domain("www.google.co.uk")
# => "google.co.uk"

Parse a domain without subdomains:

domain = PublicSuffix.parse("google.com")
# => #<PublicSuffix::Domain>
domain.tld
# => "com"
domain.sld
# => "google"
domain.trd
# => nil
domain.domain
# => "google.com"
domain.subdomain
# => nil

Parse a domain with subdomains:

domain = PublicSuffix.parse("www.google.com")
# => #<PublicSuffix::Domain>
domain.tld
# => "com"
domain.sld
# => "google"
domain.trd
# => "www"
domain.domain
# => "google.com"
domain.subdomain
# => "www.google.com"

Simple validation example:

PublicSuffix.valid?("google.com")
# => true

PublicSuffix.valid?("www.google.com")
# => true

# Explicitly forbidden, it is listed as a private domain
PublicSuffix.valid?("blogspot.com")
# => false

# Unknown/not-listed TLD domains are valid by default
PublicSuffix.valid?("example.tldnotlisted")
# => true

Strict validation (without applying the default * rule):

PublicSuffix.valid?("example.tldnotlisted", default_rule: nil)
# => false

Fully Qualified Domain Names

This library automatically recognizes Fully Qualified Domain Names. A FQDN is a domain name that end with a trailing dot.

# Parse a standard domain name
PublicSuffix.domain("www.google.com")
# => "google.com"

# Parse a fully qualified domain name
PublicSuffix.domain("www.google.com.")
# => "google.com"

Private domains

This library has support for switching off support for private (non-ICANN).

# Extract a domain including private domains (by default)
PublicSuffix.domain("something.blogspot.com")
# => "something.blogspot.com"

# Extract a domain excluding private domains
PublicSuffix.domain("something.blogspot.com", ignore_private: true)
# => "blogspot.com"

# It also works for #parse and #valid?
PublicSuffix.parse("something.blogspot.com", ignore_private: true)
PublicSuffix.valid?("something.blogspot.com", ignore_private: true)

If you don't care about private domains at all, it's more efficient to exclude them when the list is parsed:

# Disable support for private TLDs
PublicSuffix::List.default = PublicSuffix::List.parse(File.read(PublicSuffix::List::DEFAULT_LIST_PATH), private_domains: false)
# => "blogspot.com"
PublicSuffix.domain("something.blogspot.com")
# => "blogspot.com"

Add domain to list

If you want to manually add a domain to the list just run:

PublicSuffix::List.default << PublicSuffix::Rule.factory('onmicrosoft.com')

What is the Public Suffix List?

The Public Suffix List is a cross-vendor initiative to provide an accurate list of domain name suffixes.

The Public Suffix List is an initiative of the Mozilla Project, but is maintained as a community resource. It is available for use in any software, but was originally created to meet the needs of browser manufacturers.

A "public suffix" is one under which Internet users can directly register names. Some examples of public suffixes are ".com", ".co.uk" and "pvt.k12.wy.us". The Public Suffix List is a list of all known public suffixes.

Why the Public Suffix List is better than any available Regular Expression parser?

Previously, browsers used an algorithm which basically only denied setting wide-ranging cookies for top-level domains with no dots (e.g. com or org). However, this did not work for top-level domains where only third-level registrations are allowed (e.g. co.uk). In these cases, websites could set a cookie for co.uk which will be passed onto every website registered under co.uk.

Clearly, this was a security risk as it allowed websites other than the one setting the cookie to read it, and therefore potentially extract sensitive information.

Since there is no algorithmic method of finding the highest level at which a domain may be registered for a particular top-level domain (the policies differ with each registry), the only method is to create a list of all top-level domains and the level at which domains can be registered. This is the aim of the effective TLD list.

As well as being used to prevent cookies from being set where they shouldn't be, the list can also potentially be used for other applications where the registry controlled and privately controlled parts of a domain name need to be known, for example when grouping by top-level domains.

Source: https://wiki.mozilla.org/Public_Suffix_List

Not convinced yet? Check out this real world example.

Does `PublicSuffix` make requests to Public Suffix List website?

No. PublicSuffix comes with a bundled list. It does not make any HTTP requests to parse or validate a domain.

Support

Library documentation is auto-generated from the README and the source code, and it's available at https://rubydoc.info/gems/public_suffix.

The PublicSuffix bug tracker is here: https://github.com/weppos/publicsuffix-ruby/issues
The PublicSuffix code repository is here: https://github.com/weppos/publicsuffix-ruby. Contributions are welcome! Please include tests and/or feature coverage for every patch, and create a topic branch for every separate change you make.

Consider subscribing to Tidelift which provides Enterprise support for this project as part of the Tidelift Subscription. Tidelift subscriptions also help the maintainers by funding the project, which in turn allows us to ship releases, bugfixes, and security updates more often.

Security and Vulnerability Reporting

Full information and description of our security policy please visit SECURITY.md

Changelog

See the CHANGELOG.md file for details.

License

The Public Suffix List source is subject to the terms of the Mozilla Public License, v. 2.0.

Definitions

tld = Top level domain, this is in reference to the last segment of a domain, sometimes the part that is directly after the "dot" symbol. For example, mozilla.org, the .org portion is the tld.

sld = Second level domain, a domain that is directly below a top-level domain. For example, in https://www.mozilla.org/en-US/, mozilla is the second-level domain of the .org tld.

trd = Transit routing domain, or known as a subdomain. This is the part of the domain that is before the sld or root domain. For example, in https://www.mozilla.org/en-US/, www is the trd.

FQDN = Fully Qualified Domain Names, are domain names that are written with the hostname and the domain name, and include the top-level domain, the format looks like [hostname].[domain].[tld]. for ex. [www].[mozilla].[org].

publicsuffix-ruby's People

Contributors

Stargazers

Watchers

Forkers

perfectlynormal camilo thibaudgg acmarques leereilly razorinc salizzar fabiokr rb2k orenmazor autreplanete jejhigson rabbitt namestrap larrycorbin jedisct1 jhnvz rwojsznis jdennes autumn mnordin jredville tkwwarchive raeno vigo krzyzak postmodern nicolasleger seomoz chickenlove ststnk openlogic itcentralstation adamniedzielski andirayo betterment jasonparser digideskio trevorturk mcclymont sztosz rfelix2121 gam3 iamliamnorton netinmax dentarg unixcharles getkya strikingly thorrsson pzb yui-knk billyparadise shopify jjb bricker typeoneerror wineries thromera sefikaboz pjg pugsiman michiomochi raansari clearbit mm580486 jockee vijay-ror ngrichyj4 rainerborene sathishvc vy-labs teamintricately hkak03key mschnitzer gquirino banzaiman alcanari sammo1235 brianhawley jsugarman rafbm ulugbekov cseeman danieldocki dnsbackup corsearch gogainda jarthod wmontgomery-splunk bearerpipelinetest olleolleolle martijnrusschen andreynering sowmiyajeevanandham iq-scm elliotwutingfeng sysfce2 m-nakamura145 wolfi-chainguard-demo

publicsuffix-ruby's Issues

Support native email address parsing

Would be great it PublicSuffix could natively support parsing email addresses, rather than just FQDNs. It would save developers the step of running the email through regex to strip the email from the domain before passing to PublicSuffix.

As an example, here's how Gman pre-processes emails in such a way that is still fully backwards compatible with passing a domain directly: https://github.com/benbalter/gman/blob/master/lib/gman.rb#L39

`parse` method sporadic throws DomainInvalid errors

The parse method seems to randomly choke up when parsing domains. It's given me DomainInvalid errors for domains that when I try simply rerunning the command, work fine. It's happened for .com's, .ca's, .org's and several others. Doesn't seem to discriminate between them.

The validation does not respect RFC952

Hi,

I implemented the gem "public_suffix" to control domains in a form.
But I noted wrong domains are designated as valid by the gem.

Here's an example to illustrate my problem:

PublicSuffix.valid?("goo,gle.com")
=> true

PublicSuffix.valid?("-google.com")
=> true

PublicSuffix.valid?("google-.com")
=> true

As you can read in the RFC952

A "name" (Net, Host, Gateway, or Domain name) is a text string up
to 24 characters drawn from the alphabet (A-Z), digits (0-9), minus
sign (-), and period (.).  Note that periods are only allowed when
they serve to delimit components of "domain style names". (See
RFC-921, "Domain Name System Implementation Schedule", for
background).  No blank or space characters are permitted as part of a
name. No distinction is made between upper and lower case.  The first
character must be an alpha character.  The last character must not be
a minus sign or period.

To fix this temporarily, I added a format validation before to call "public_suffix".

validates :domain, presence: true, format: { with: /\A(?!-)[[:alnum:]\-]+(?<!-)(.(?!-)[[:alnum:]\-]+(?<!-))+\z/ix }

Make DEFAULT_DEFINITION_PATH configurable?

Right now, keeping publicsuffix-ruby up to date with the public suffix list in production is not easy. If we can have a configurable path to load definitions from, then we can decouple updating the gem from updating definitions cleanly.

iki.fi

It seems that iki.fi is in the definitions but it returns invalid:

PublicSuffix.valid?('iki.fi')
=> false

DomainNotAllowed Exception for AWS s3

I'm currently receiving PublicSuffix::DomainNotAllowed Exception when trying to parse a S3 URL. It seems like it should be valid and I do see it in the definitions.txt.

Here is an example that causes the exception:

url = 'https://s3.amazonaws.com/content.udacity-data.com/techdocs/UdacityCourseCatalogAPIDocumentation-v0.pdf'
uri = URI.parse(url)
#<URI::HTTPS https://s3.amazonaws.com/content.udacity-data.com/techdocs/UdacityCourseCatalogAPIDocumentation-v0.pdf>
uri.host
"s3.amazonaws.com"
domain = PublicSuffix.parse(uri.host)
*** PublicSuffix::DomainNotAllowed Exception: `s3.amazonaws.com' is not allowed according to Registry policy

nil

When tracing it seems the issue is here in that the parts.join is all of s3.amazonaws.com which makes the first (.*) fail to match anything in rule.rb.

(byebug) domain
"s3.amazonaws.com"
(byebug) parts.join('\.')
"s3\\.amazonaws\\.com"
(byebug) domain.to_s.chomp(".") =~ /^(.*)\.(#{parts.join('\.')})$/
nil
(byebug) [$1, $2]
[nil, nil]

Is it expected that s3 would be rejected? Thanks!

Leading http:// makes it "invalid"

I was quite surprised by this:

irb(main):003:0> PublicSuffix.valid?("http://www.google.com")
=> false

Is there a reason for this, or is it an oversight? I was hoping to use this library to determine whether a string was a valid URL or not, but this makes it give false negatives for that question. Am I missing something?

Fortunately for my purposes, the URI::regex gets those types of strings, while it fails on "google.com" type strings, so between the two I'm able to identify what I'm looking for. But I figured I'd mention this here in case it's an actual bug.

API extension for validation?

Within my app I find myself wanting a method that ensures that a given domain string is both valid and allowed. PublicSuffixService.parse does this but throws exceptions which a a bit of a pain to rescue while PublicSuffixService.valid? only checks if a given string is valid but not if it's allowed.

A PublicSuffixService.allowed? method that runs both checks and returns a bool would be most helpful.

request: error handling without exceptions

Exception handling is usually supposed to be for exceptional things instead of basic error handling. if I ask a user for a domain name, it is not exceptional that they will give an invalid input (particularly some kind of URI).
It would be nice to have a parsing interface that will give a nil response if it can't parse.

I have this right now:

      tld = begin PublicSuffixService.parse(domain_name).tld
      rescue PublicSuffixService::DomainInvalid
        nil
      end

I would like to write this:

tld = PublicSuffixService.to_domain(domain_name).try(:tld)

Website requirements don't match rakefile/gemspec

Another documentation issue for the website. If this isn't the place for those, let me know please.

On the website (http://simonecarletti.com/code/publicsuffix/) it says that it supports ruby >= 1.8.7. But the Rakefile and .gemspec state that they require >= 2.0. These should ideally be consistent, neh?

For reference, I had just read the website and went ahead and tried it, and it seems to be working for me, and I have

$ ruby --version
ruby 1.9.3p547 (2014-05-14 revision 45962) [x86_64-linux]

But even if it works on the older version, if >= 2.0 is going to be considered what's supported, it would be good to change the website to reflect that.

Add List#[] method for checking whether a rule exists in the list

I would like to query the default list to determine if a domain wildcard rule is actually a public suffix rule.

list['*.wired.com']
# => nil

list['*.com']
# => #<PublicSuffix::Rule::Normal:0x007f9870738bc8 @name="com", @value="com", @type=:normal, @labels=["com"]>

http://www.google.com is invalid

I don't understand why

PublicSuffix.valid?("http://www.google.com")
# => false

"http://somedomain.com" is detected as a valid domain

I don't think this should happen seeing as it's an URL, not a domain

PublicSuffix.valid?("http://somedomain.com")
 => true

`parse` method fails on numerous valid domains

Running PublicSuffix.parse(...) against a variety of domains raises a PublicSuffix::DomainNotAllowed error. Examples include:

dyndns.biz
homelinux.com
homedns.org
dnsalias.com
walbrzych.pl
withgoogle.com

Problem with www.okuno.abeno.osaka.jp

ruby-head >   test = PublicSuffixService.parse("www.okuno.abeno.osaka.jp")
 => www.okuno.abeno.osaka.jp 
ruby-head > test.tld
 => "abeno.osaka.jp" 
ruby-head > test.domain
 => "okuno.abeno.osaka.jp" 
ruby-head > test.subdomain
 => "www.okuno.abeno.osaka.jp"

shouldn't the subdomain be only "www"?

Amazon EC 2 domain not parsing?

Doing some testing in IRB i went through some diffrent domains and did domain.domain printing. Everything worked as expected except Amazon EC2 domains, they returned the full hostname instead of just amazonaws.com.

An example:
$ ruby --version
ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.3.0]
$ irb
2.0.0p247 :001 > require 'public_suffix'
=> true
2.0.0p247 :002 > domain=PublicSuffix.parse("www.google.com")
=> #<PublicSuffix::Domain:0x007f95f4bd8300 @tld="com", @sld="google", @trd="www">
2.0.0p247 :003 > puts domain.domain
google.com
=> nil
2.0.0p247 :004 > domain=PublicSuffix.parse("ec2-54-251-72-163.ap-southeast-1.compute.amazonaws.com")
=> #<PublicSuffix::Domain:0x007f95f418d0a8 @tld="ap-southeast-1.compute.amazonaws.com", @sld="ec2-54-251-72-163", @trd=nil>
2.0.0p247 :005 > puts domain.domain
ec2-54-251-72-163.ap-southeast-1.compute.amazonaws.com
=> nil
2.0.0p247 :006 >

Here is a list of all the amazon DNS names I tested:
ec2-54-225-231-131.compute-1.amazonaws.com
ec2-50-17-60-214.compute-1.amazonaws.com
ec2-54-229-21-118.eu-west-1.compute.amazonaws.com
ec2-54-244-19-175.us-west-2.compute.amazonaws.com
ec2-54-247-65-162.eu-west-1.compute.amazonaws.com
ec2-50-17-176-153.compute-1.amazonaws.com
ec2-54-214-254-195.us-west-2.compute.amazonaws.com
ec2-204-236-208-101.compute-1.amazonaws.com
ec2-54-224-20-217.compute-1.amazonaws.com
ec2-54-213-44-25.us-west-2.compute.amazonaws.com
ec2-54-245-226-187.us-west-2.compute.amazonaws.com
ec2-50-19-127-165.compute-1.amazonaws.com
ec2-54-249-132-113.ap-northeast-1.compute.amazonaws.com
ec2-54-225-227-229.compute-1.amazonaws.com
ec2-174-129-32-186.compute-1.amazonaws.com
ec2-50-17-57-56.compute-1.amazonaws.com
ec2-176-34-96-161.eu-west-1.compute.amazonaws.com
ec2-54-242-61-103.compute-1.amazonaws.com
ec2-54-245-107-90.us-west-2.compute.amazonaws.com
ec2-107-20-238-127.compute-1.amazonaws.com
ec2-54-227-178-241.compute-1.amazonaws.com
ec2-54-221-202-206.compute-1.amazonaws.com
ec2-54-236-201-7.compute-1.amazonaws.com
ec2-50-19-124-7.compute-1.amazonaws.com
ec2-54-225-108-132.compute-1.amazonaws.com
ec2-174-129-29-28.compute-1.amazonaws.com
ec2-107-22-189-139.compute-1.amazonaws.com
ec2-54-224-13-157.compute-1.amazonaws.com
ec2-54-218-37-122.us-west-2.compute.amazonaws.com
ec2-54-221-199-48.compute-1.amazonaws.com
ec2-54-249-9-114.ap-northeast-1.compute.amazonaws.com
ec2-23-21-75-33.compute-1.amazonaws.com
ec2-54-225-104-230.compute-1.amazonaws.com
ec2-54-243-149-18.compute-1.amazonaws.com
ec2-46-137-228-16.ap-southeast-1.compute.amazonaws.com
ec2-54-234-242-191.compute-1.amazonaws.com
ec2-54-214-128-38.us-west-2.compute.amazonaws.com
ec2-174-129-141-65.compute-1.amazonaws.com
ec2-54-229-126-95.eu-west-1.compute.amazonaws.com
ec2-54-242-54-43.compute-1.amazonaws.com
ec2-79-125-79-81.eu-west-1.compute.amazonaws.com
ec2-54-245-100-30.us-west-2.compute.amazonaws.com
ec2-54-221-195-146.compute-1.amazonaws.com
ec2-54-236-193-203.compute-1.amazonaws.com
ec2-54-213-33-63.us-west-2.compute.amazonaws.com
ec2-54-234-123-94.compute-1.amazonaws.com
ec2-54-243-145-116.compute-1.amazonaws.com
ec2-54-249-121-151.ap-northeast-1.compute.amazonaws.com
ec2-184-72-13-152.us-west-1.compute.amazonaws.com
ec2-54-225-217-11.compute-1.amazonaws.com
ec2-174-129-21-224.compute-1.amazonaws.com
ec2-23-22-162-204.compute-1.amazonaws.com
ec2-176-34-85-199.eu-west-1.compute.amazonaws.com
ec2-54-214-240-75.us-west-2.compute.amazonaws.com
ec2-107-20-111-226.compute-1.amazonaws.com
ec2-54-251-72-163.ap-southeast-1.compute.amazonaws.com
ec2-54-236-190-45.compute-1.amazonaws.com
ec2-23-20-208-34.compute-1.amazonaws.com
ec2-54-225-97-170.compute-1.amazonaws.com

The internet is broken

Using public_suffix to parse incoming emails. The domain 'glow.sch.uk' is coming back as invalid, and unable to parse. This is correct behavior according to the registry, which lists '*.sch.uk'.

Despite this, email addresses do exist on this domain. So... it's broken.

This a long-winded way to suggest that a soft parse method be available, which doesn't raise errors and tries to the best it can to split the domain into the tld, sld, and trd attributes.

Cheers!

New release?

It would be great to have the update definition.txt in the last rubygems release. Can you make a new release please?

Thanks!

Have you see this before? Gem::InstallError: public_suffix requires Ruby version >= 2.0

I have installed 2.0, 2.1.0 and 2.2.1 on a local machine and on the koding.com vm and I get this error every time. gem install public_suffix -v '1.5.0' also fails with the same error. I am assuming that there is something that I missed and not the gem.

Do you have any ideas?

$ bundle install
Fetching gem metadata from https://rubygems.org/.........
Fetching version metadata from https://rubygems.org/..
Resolving dependencies......
Using RedCloth 4.2.9
Using i18n 0.7.0
Using json 1.8.2
Using minitest 5.5.1
Using thread_safe 0.3.5
Using tzinfo 1.2.2
Using activesupport 4.2.1
Using addressable 2.3.8
Using blankslate 2.1.2.4
Using hitimes 1.2.2
Using timers 4.0.1
Using celluloid 0.16.0
Using fast-stemmer 1.0.2
Using classifier-reborn 2.0.3
Using coffee-script-source 1.9.1
Using execjs 2.4.0
Using coffee-script 2.3.0
Using colorator 0.1
Using colored 1.2
Using rdoc 4.2.0
Using css_parser 1.2.6
Using mini_portile 0.6.2
Using nokogiri 1.6.6.2
Using deadweight 0.2.2
Using ffi 1.9.8
Using ethon 0.7.3
Using gemoji 2.1.0
Using net-dns 0.8.0

Gem::InstallError: public_suffix requires Ruby version >= 2.0.
An error occurred while installing public_suffix (1.5.0), and Bundler cannot continue.
Make sure that `gem install public_suffix -v '1.5.0'` succeeds before bundling.

$ ruby -v
ruby 2.2.1p85 (2015-02-26 revision 49769) [x86_64-linux]

This is what is in my Gemfile

source 'https://rubygems.org'

gem 'github-pages'
gem 'html-proofer'
gem 'scss_lint_reporter_checkstyle'
gem 'scss-lint'
gem 'mdl'
gem 'deadweight'

PublicSuffixService#parse => NoMethodError on some values

Hi I'm getting NoMethodError when doing this

PublicSuffixService.parse('risk.do')

Digging a little deeper the 'problem' seems to be that Rule#decompose is returning [nil, nil] for that rule (*.do) wildcard then split is called on nil in line 57 of lib/public_suffix_service.

  def self.parse(domain)
    rule = RuleList.default.find(domain) || raise(InvalidDomain, "`#{domain}' is not a valid domain")

    left, right = rule.decompose(domain)
    parts       = left.split(".")

I'd have send a patch but I'm not sure what the good behaviour should be here:

Raise invalid domain?
Somehow parse that input.

Enumerable list of valid TLDs?

Is there a way to get the list of the current of the top-level domains?

 require 'public_suffix"
 tld = PublicSuffix:List.all()
 tld.each { |t| puts t }

I'm interested in domains like .es, .us, .info, not co.uk.

Parsing not possible for root-domains like wa.gov.au or sa.gov.au

I am doing the following:
I am trying to parse a host to get the domain or the subdomain
( PublicSuffix.parse(host).domain )

I get an error like " 'hostname' is not allowed according to Registry policy ". But wa.gov.au and sa.gov.au are valid hosts.

But when I am trying to get the domain for "www.sa.gov.au" the result is "www.sa.gov.au" and subdomain is nil.

Cheers,
Dominik

Question about JRuby 1.7.19/Ruby 1.9.3 Support

I see that release 1.5.0 drops support for Ruby 1.9.3, and hence, Jruby 1.7.19, which we use.

I reviewed the v1.4.6..v1.5.0 diff, and I didn't see any obvious breaking change that requires Ruby 2.0.

Is there a known issue with Ruby 1.9.3 and v1.5.0?

Is the 0.8.0 gem broken?

The files in the lib directory in the gem have permission 600 so when I require the gem I can't read them.

GitHub project moved - Links need update

The introduction page GitHub project link in the top-right corner goes to the old project.

It might also be worth pushing out a new version of the gem with the old name to update the links on that page to point to the new pages. As it has more downloads, it tends to show up first in search results.

Simple interface for injecting rules into default set

Is there a simple way of injecting rules into the default set which is maintained across calls to PublicSuffixService? It would be most helpful if there was an interface similar to this:

PublicSuffixService.valid?('foo.com') => true
PublicSuffixService.valid?('foo.test') => false
PublicSuffixService.add_rule('*.test')
PublicSuffixService.valid?('foo.test') => true

www..example.com is valid

PublicSuffix.parse("www..example.com")

# No errors, it thinks this is a valid domain

Expected behavior is probably to throw an error because this domain is invalid? ( It has two dots between www and example.com )

is not allowed according to Registry policy

Something going wrong

2.0.0 (main):0 > PublicSuffix.parse('http://volgograd.ru')
PublicSuffix::DomainInvalid: `http://volgograd.ru' is not a valid domain
from /Users/bugagazavr/.rvm/gems/ruby-2.0.0-p247/gems/public_suffix-1.3.1/lib/public_suffix.rb:68:in `parse'
2.0.0 (main):0 > PublicSuffix.parse('volgograd.ru')
PublicSuffix::DomainNotAllowed: `volgograd.ru' is not allowed according to Registry policy

but google works fine:

2.0.0 (main):0 > PublicSuffix.parse('google.ru')
=> #<PublicSuffix::Domain:0x007fcf455a58e0 @sld="google", @tld="ru", @trd=nil>

Allow full URLs

If I'm trying to parse a full URL, it seems strange that I need to use both PublicSuffix and URI. e.g.:

domain = PublicSuffix.parse(URI.parse('http://www.example.com/foo/bar?x=y').host).domain

It seems like it would be relatively easy for this gem's #parse method to automatically to a URI.parse(...).domain to fetch the domain behind the scenes, falling back on the default implementation if URI can't parse the string.

Expected behavior for rule exceptions

For Gman I use a custom public suffix-formatted list, and would like to add an exception. I'm not sure if the current rule exception behavior works as intended, or if I'm just not understanding it properly:

On the public suffix list, I see an entry !city.kobe.jp, but can't seem to get any indication of the exception:

[14] pry(main)> PublicSuffix.valid? "city.kobe.jp"
=> true
[15] pry(main)> PublicSuffix.valid? "foo.city.kobe.jp"
=> true
[16] pry(main)> PublicSuffix.valid? "foo.bar.city.kobe.jp"
=> true

In my own implementation, when I call find on an excepted domain, it returns the Exception as expected, but rule.allow? still passes the domain (the check that valid? uses).

Am I missing something?

Example for DomainInvalid isn't

I'm not sure if this is the place for issues relating to the documentation on the site, but I noticed this while running through the examples on http://simonecarletti.com/code/publicsuffix/:

For raising PublicSuffix::DomainInvalid it uses "example.xxx" and "www.example.xxx", but these do not raise that error, as the .xxx domain now exists. If the examples are changed to "example.yyy" and "www.example.yyy", then they will correctly raise that error.

License missing from gemspec

RubyGems.org doesn't report a license for your gem. This is because it is not specified in the gemspec of your last release.

via e.g.

  spec.license = 'MIT'
  # or
  spec.licenses = ['MIT', 'GPL-2']

Including a license in your gemspec is an easy way for rubygems.org and other tools to check how your gem is licensed. As you can imagine, scanning your repository for a LICENSE file or parsing the README, and then attempting to identify the license or licenses is much more difficult and more error prone. So, even for projects that already specify a license, including a license in your gemspec is a good practice. See, for example, how rubygems.org uses the gemspec to display the rails gem license.

There is even a License Finder gem to help companies/individuals ensure all gems they use meet their licensing needs. This tool depends on license information being available in the gemspec. This is an important enough issue that even Bundler now generates gems with a default 'MIT' license.

I hope you'll consider specifying a license in your gemspec. If not, please just close the issue with a nice message. In either case, I'll follow up. Thanks for your time!

Appendix:

If you need help choosing a license (sorry, I haven't checked your readme or looked for a license file), GitHub has created a license picker tool. Code without a license specified defaults to 'All rights reserved'-- denying others all rights to use of the code.
Here's a list of the license names I've found and their frequencies

p.s. In case you're wondering how I found you and why I made this issue, it's because I'm collecting stats on gems (I was originally looking for download data) and decided to collect license metadata,too, and make issues for gemspecs not specifying a license as a public service :). See the previous link or my blog post about this project for more information.

Uppercase domain is reported as not valid

This fails:

PublicSuffix.parse('INSTAGRAM.COM')
PublicSuffix::DomainInvalid: `INSTAGRAM.COM' is not a valid domain

But this works:

PublicSuffix.parse('INSTAGRAM.COM'.downcase)
#<PublicSuffix::Domain:0x5c56a5e5 @sld="instagram", @tld="com", @trd=nil>

This seems unintuitive. If domain names need to be downcased before passing them into PublicSuffix, this should at least be more clearly documented in the README.

Russian-Cyrillic domains are invalid

All russian-cyrilic domains are invalid e.g. http://xn--90aiwacf1a.xn--p1ai/

www.gov.uk returns www for sld

https://www.gov.uk is a valid website.

>> domain = PublicSuffix.parse("www.gov.uk")
www.gov.uk
>> domain.sld
"www"

I understand why it returns www - but it looks a bit silly in my application.
Ideally I'd like it to return gov - is there anything I can do about it other than manually add an exception for that domain?

csiro.au reported as invalid

csiro.au looks like a legit site and tld.

About csiro.
http://csiro.au/en/Portals/About-CSIRO.aspx

Sites we found using it.
http://es.csiro.au/
http://www.clw.csiro.au/

I'm getting DomainInvalid error

I got this error while trying to parse the url.

PublicSuffix::DomainInvalid at /create
`http://www.youtube.com/watch?v=WsEkFpWQZ' is not a valid domain

Problem with blogspot.com domains

This doesn't seem right :(
(Running 1.9.2)

irb(main):012:0> domain = PublicSuffixService.parse("narf.blogspot.com")
=> narf.blogspot.com
irb(main):013:0> domain.tld
=> "blogspot.com"
irb(main):014:0> domain.sld
=> "narf"
irb(main):015:0> domain.trd
=> nil
irb(main):016:0> domain.domain
=> "narf.blogspot.com"
irb(main):017:0> domain.subdomain
=> nil

Any suggestions to speed up list parsing?

I use PublicSuffix as part of Gman.

In addition to using PublicSuffix's native valid? check, Gman also contains it's own public-suffix formatted list of government domains which it then checks using PublicSuffix's rule logic. The relevant code is:

   # check using public suffix's standard logic
    rule = Gman.list.find domain

    # domain is on the domain list and
    # domain is not explicitly blacklisted and
    # domain matches a standard public suffix list rule
    !rule.nil? && rule.type != :exception && rule.allow?(".#{domain}")

Profiling against 10,000 random (valid) domains, which took about 250 seconds, here's the breakdown:

 %self      total      self      wait     child     calls  name
 23.59     78.365    63.150     0.000    15.215 32708764   PublicSuffix::Rule::Base#odiff
 18.37    232.715    49.154     0.000   183.560 32708764   PublicSuffix::Rule::Base#match?
 16.08    105.479    43.029     0.000    62.451 32777587   <Class::PublicSuffix::Domain>#domain_to_labels
 14.47     38.736    38.736     0.000     0.000 32827949   String#split
  8.26    254.831    22.116     0.000   232.715    49998   Array#select
  5.68     15.215    15.215     0.000     0.000 32708764   Array#[]
  5.10     13.648    13.648     0.000     0.000 32787586   Array#reverse
  3.84     10.270    10.270     0.000     0.000 33055272   String#to_s
  0.79      2.507     2.106     0.000     0.400    79975   PublicSuffix::Rule::Normal#decompose
  0.40      1.066     1.066     0.000     0.000    49998   Array#values_at
  0.18      3.915     0.477     0.000     3.438    30001   <Class::Addressable::URI>#parse
  0.15    256.623     0.414     0.000   256.209    49998   PublicSuffix::List#select

It appears the bottleneck is in PublicSuffix's matching. I see #2, but do you have any suggestions how to speed up PublicSuffix's parsing, both for it's own native list, and for Gman's vendor list?

Support a way of adding a tld at runtime?

I don't know if you think it is a good idea, but what you think about ability to temporarily extend PublicSuffix with a new tld?

For example - lets say locally I have a domain called "foo.dev" which is not a valid TLD, but yet if I can say:

PublicSuffix.add_tld = "dev"
domain =  PublicSuffix.parse("foo.dev")
.. no errors here

undefined method `include?' for nil:NilClass in valid? call

I can't really tell which domain caused this, but I just fished this out of my logs:

undefined method `include?' for nil:NilClass
"/usr/local/rvm/gems/ruby-1.9.3-p194/gems/public_suffix-1.1.1/lib/public_suffix/list.rb:199:in `select'"
"/usr/local/rvm/gems/ruby-1.9.3-p194/gems/public_suffix-1.1.1/lib/public_suffix/list.rb:184:in `find'"
"/usr/local/rvm/gems/ruby-1.9.3-p194/gems/public_suffix-1.1.1/lib/public_suffix.rb:127:in `valid?'

Fail parsing domain 'te.ua'

PublixSuffix.parse('te.ua') raises Exception with error '`te.ua' is not allowed according to Registry policy'
It seems that it could not be fixed, because te.ua is domain zone.

Handling IDN/Punycode

Hi,

It would be nice to add IDN/Punycode encoding to PublicSuffix.

I'm currently using simpleidn to do the job but it would make sense to have everything inside a single class.

I'll try to send a pull request when I'll have some time for this but maybe you already have plans on this subject.

Thanks anyway for this great gem !

Trailing period causes DomainNotAllowed exception

With public_suffix_service 0.7.0 parsing a standard domain works well:

ruby-1.8.7-p249 > PublicSuffixService.parse('example.com')
 => #<PublicSuffixService::Domain:0x101905cc0 @tld="com", @trd=nil, @sld="example">

Adding trailing punctuation correctly returns a DomainInvalid error:

ruby-1.8.7-p249 > PublicSuffixService.parse('example.com,')
PublicSuffixService::DomainInvalid: `example.com,' is not a valid domain
ruby-1.8.7-p249 > PublicSuffixService.parse('example.com:')
PublicSuffixService::DomainInvalid: `example.com:' is not a valid domain

But if the trailing punctuation is a period, the error returned is instead DomainNotAllowed

ruby-1.8.7-p249 > PublicSuffixService.parse('example.com.')
PublicSuffixService::DomainNotAllowed: `example.com.' is not allowed according to Registry policy
ruby-1.8.7-p249 > PublicSuffixService.parse('*.example.com.')
PublicSuffixService::DomainNotAllowed: `*.example.com.' is not allowed according to Registry policy

It's sometimes useful to handle miskeyed input data (DomainInvalid) differently to domains that shouldn't exist (DomainInvalid). For example, in an application I'm working on we ignore DomainInvalid because some of the hostnames are on private networks (eg: host.bigcompany). Without extra error checking code our application will fail to handle a common data input mistake.

Also, example.com. is actually a valid hostname - the trailing . implies a fully qualified domain name.

' is not a valid domain

I get the following error:
/home/dmaynor/.rvm/gems/ruby-1.9.3-p194/gems/public_suffix-1.1.1/lib/public_suffix.rb:68:in parse':adsl-75-17-113-25.dsl.pltn13.sbcglobal.net (PublicSuffix::DomainInvalid)
' is not a valid domain
from ./parse_host.rb:7:in block in <main>' from ./parse_host.rb:6:ineach_line'
from ./parse_host.rb:6:in `

From the input:
adsl-75-17-113-25.dsl.pltn13.sbcglobal.net

I have a list of a million hostnames, and it does this to everyone. The test code is real simple:

File.open("scan_host_names").each_line{ |hname|
domain=PublicSuffix.parse(hname)
puts domian.tld
}

I also tried it in irb and got a similar error.

"www. .com" is valid (with spaces)

Why is url like www. .com valid?

Include preceding comment in rule definition

Would be awesome if the rule was aware of its grouping in the list, such that, if a domain matched a rule, I could tell which group it falls within.

Theoretically, rules should correspond to the first line of the immediately preceding comment block, but obviously, it's just a convention, not a standard.

'googlecode.com' is invalid (but 'google.com' is...)

Raises the following exception:
irb(main):033:0> PublicSuffix.parse('googlecode.com').domain
PublicSuffix::DomainNotAllowed: `googlecode.com' is not allowed according to Registry policy

Also - when parsing 'www.googlecode.com' it returns 'googlecode.com' as the TLD and 'www.googlecode.com' as the domain (where obviously should have been 'com' and 'googlecode.com').

Works correctly on other domains that I've tried, it's only this one..

'com' domain parsing

It's normal that 'com' domain is valid but not parseable?
I think all valid domain should be parseable, so 'com' domain should be invalid. No?