Coder Social home page Coder Social logo

remusao / tldts Goto Github PK

View Code? Open in Web Editor NEW
450.0 2.0 20.0 8.72 MB

JavaScript Library to extract domains, subdomains and public suffixes from complex URIs.

Home Page: https://npmjs.com/tldts

License: MIT License

JavaScript 4.30% TypeScript 95.55% Shell 0.11% Makefile 0.04%
typescript tld public-suffix-list url-parsing domain uri url javascript

tldts's Introduction

tldts - Blazing Fast URL Parsing

tldts is a JavaScript library to extract hostnames, domains, public suffixes, top-level domains and subdomains from URLs.

Features:

  1. Tuned for performance (order of 0.1 to 1 ฮผs per input)
  2. Handles both URLs and hostnames
  3. Full Unicode/IDNA support
  4. Support parsing email addresses
  5. Detect IPv4 and IPv6 addresses
  6. Continuously updated version of the public suffix list
  7. TypeScript, ships with umd, esm, cjs bundles and type definitions
  8. Small bundles and small memory footprint
  9. Battle tested: full test coverage and production use

Install

npm install --save tldts

Usage

Using the command-line interface:

$ npx tldts 'http://www.writethedocs.org/conf/eu/2017/'
{
  "domain": "writethedocs.org",
  "domainWithoutSuffix": "writethedocs",
  "hostname": "www.writethedocs.org",
  "isIcann": true,
  "isIp": false,
  "isPrivate": false,
  "publicSuffix": "org",
  "subdomain": "www"
}

Or from the command-line in batch:

$ echo "http://www.writethedocs.org/\nhttps://example.com" | npx tldts
{
  "domain": "writethedocs.org",
  "domainWithoutSuffix": "writethedocs",
  "hostname": "www.writethedocs.org",
  "isIcann": true,
  "isIp": false,
  "isPrivate": false,
  "publicSuffix": "org",
  "subdomain": "www"
}
{
  "domain": "example.com",
  "domainWithoutSuffix": "example",
  "hostname": "example.com",
  "isIcann": true,
  "isIp": false,
  "isPrivate": false,
  "publicSuffix": "com",
  "subdomain": ""
}

Programmatically:

const { parse } = require('tldts');

// Retrieving hostname related informations of a given URL
parse('http://www.writethedocs.org/conf/eu/2017/');
// { domain: 'writethedocs.org',
//   domainWithoutSuffix: 'writethedocs',
//   hostname: 'www.writethedocs.org',
//   isIcann: true,
//   isIp: false,
//   isPrivate: false,
//   publicSuffix: 'org',
//   subdomain: 'www' }

Modern ES6 modules import is also supported:

import { parse } from 'tldts';

Alternatively, you can try it directly in your browser here: https://npm.runkit.com/tldts

Check README.md for more details about the API.

Contributors

tldts is based upon the excellent tld.js library and would not exist without the many contributors who worked on the project.

This project would not be possible without the amazing Mozilla's public suffix list either. Thank you for your hard work!

License

MIT License.

tldts's People

Contributors

chrmod avatar dependabot-preview[bot] avatar dependabot-support avatar dependabot[bot] avatar fulldecent avatar ghostwords avatar greenkeeper[bot] avatar jdesboeufs avatar jhnns avatar kellycampbell avatar kikobeats avatar krinkle avatar olivoil avatar remusao avatar thom4parisot avatar xdamman avatar yehezkielbs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

tldts's Issues

Not correctly parsing public suffix for blogspot.com

The tldjs package correctly parsed "subdomain.blogspot.com" as having a public suffix of "blogspot.com".

This tldts package seems to not recognize blogspot.com as a public suffix, as noted in the Public Suffix List.

Steps to reproduce:

const tldjs = require("tldjs");
const tldts = require("tldts");
const input = "subdomain.blogspot.com";
if (tldjs.getPublicSuffix(input) !== tldts.getPublicSuffix(input)) {
  throw new Error(`${input} did not match in both libraries`);
}

image

Error: ENOENT: no such file or directory

Hey, I get following error and more of the same type, when trying to call the parse() method:

WARNING in ./node_modules/tldts-core/dist/es6/src/is-valid.js Module Warning (from ./node_modules/source-map-loader/dist/cjs.js): Failed to parse source map from '/Users/sepe/IdeaProjects/my-project/node_modules/tldts-core/src/is-valid.ts' file: Error: ENOENT: no such file or directory, open '/Users/sepe/IdeaProjects/my-project/node_modules/tldts-core/src/is-valid.ts'

It also fails to find suffix-trie.ts, trie.ts, index.ts, subdomain.ts,...
I am using version 5.7.67 but it also accurs with older versions. We are using a react 17 project, with node 16 and typescript 4.5.5. In node_modules the packages tldts and tlds-core do exist...Do you have an idea what the problem could be in this case?

API additions

Consider the addition of the following in the API.

Methods:

  • isPublicSuffix
  • hasKnownSuffix
  • isPrivate
  • isIcann

Attributes:

  • tld
  • sld
  • trd

Optionally we could also return:

  • scheme
  • credentials
  • port
  • resource_path
  • query_string
  • fragment

Using as es6 module in chrome extension

Hello, I am trying to use this as an es6 module like so import { parse } from "tldts";
The browser extension is not a node.js project (but we are using npm just so we can use eslint) so I'm trying to understand how to import it.

Currently receiving this error

Uncaught TypeError: Failed to resolve module specifier "tldts". Relative references must start with either "/", "./", or "../".

Which makes perfect sense except I don't know where in the file path the module specifier should be

Confusion over public suffixes

I'm a little confused over how the library matches public suffixes, I can see that pages.dev is in the public suffix list, however the publicSuffix would only be dev whereas co.uk would be co.uk, is there something I'm misunderstanding here?

Different results for seemingly same hostname structures.

Hi! I've run into a problem parsing https://www.city.toyota.aichi.jp/:

tldts.parse('http://www.my.complex.domain.jp') returns as expected

domain: "domain.jp"
domainWithoutSuffix: "domain"
hostname: "www.my.complex.domain.jp"
isIcann: true
isIp: false
isPrivate: false
publicSuffix: "jp"
subdomain: "www.my.complex"

where tldts.parse('http://www.city.toyota.aichi.jp') for some reason returns

domain: "city.toyota.aichi.jp"
domainWithoutSuffix: "city"
hostname: "www.city.toyota.aichi.jp"
isIcann: true
isIp: false
isPrivate: false
publicSuffix: "toyota.aichi.jp"
subdomain: "www"

with publicSuffix and subdomain being messed up.

Am i doing something wrong or is it a bug?

P.S. i've tested this in a project and at https://npm.runkit.com/tldts with the same results.

Incorrect domain returned when parsing `http://sub.domain.global.prod.fastly.net`

Expected to get domain: domain.global.prod.fastly.net,
since global.prod.fastly.net is present in https://publicsuffix.org/list/public_suffix_list.dat.
Instead got (full result):

{
    "domain": "fastly.net",
    "domainWithoutSuffix": "fastly",
    "hostname": "sub.domain.global.prod.fastly.net",
    "isIcann": true,
    "isIp": false,
    "isPrivate": false,
    "publicSuffix": "net",
    "subdomain": "sub.domain.global.prod"
}

P.S. Thank you for this nice library!

Underscore domain parsing

Parsing an underscore domain appears to be wrong, eg

import { parse } from 'tldts';

const tldResult = parse(`https://_.rocks/iqadcontroller.js`);

// tldResult.domain is null
// tldResult.hostname is null

Dependabot can't resolve your JavaScript dependency files

Dependabot can't resolve your JavaScript dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

Error whilst updating auto in /yarn.lock:
Couldn't find package "@auto-it/[email protected]" required by "@auto-it/[email protected]" on the "npm" registry.

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

passing validSuffix not working as expected

Hello again!

I'm trying to add a few other domains to be treated as valid suffixes, using the validHosts API option.

Am I using this incorrectly? or is it possible to add my own list of public suffixes? If not I have a workaround I can use outside of this library ๐Ÿ˜ƒ

const {parse} = require('tldts');
const parsed = parse("subdomain.wordpress.com", {
   allowPrivateDomains: true,
   validHosts: ['wordpress.com']
});
expect(parsed.domain).to.equal("sub.wordpress.com");
// domain actually equals "wordpress.com"

image

`getIp` / `isIp` function

I've noticed that the library exports several helpful functions, such as getDomain and getHostname. Following this pattern, I would suggest adding getIp or export the isIp function

parse errors for some specific URLs

parse('https://www.constructor.dk')

throws an error:

Cannot read properties of undefined (reading 'www')

Same issue for https://www.constructor.fr and some others.

P.S. Thank you for this nice and efficient library!

Wildcards can occur for any segment, not just the start, and multiply

The PSL's spec says:

Wildcards are not restricted to appear only in the leftmost position, but they must wildcard an entire label. (I.e. ..foo is a valid rule: *bar.foo is not.)

...but the parsing code only looks for a single wildcard in the leftmost position:

} else if (line.startsWith('*.')) {

There aren't actually any present rules that have a non-leftmost wildcard, so this is a future-proofing concern.

Fails to parse hostnames with leading `.`

Domains with leading . are not parsed by this library. e.g.

tldts.getDomain('.example.com') === null

It is not clear to me if a leading . is invalid in a hostname from the available specs. The URL spec is very loose on the definition of a valid hostname, and the implementation in browsers accepts such a hostname:

new URL('https://.example.com').hostname === '.example.com'

Additionally, the leading dot notation is commonly used for cookies which span all subdomains of a given domain. This kind of notation is acknowledged as possible in the domain part of a cookie string (though apparently ignore in modern implementations):

"Contrary to earlier specifications, leading dots in domain names (.example.com) are ignored."
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie

Create CLI

Create a simple CLI which allows to parse a URL or a list of URLs.

tslib hidden dependency

I tried to use the metascraper-clearbit package, which depends on tldts. However, it failed with MODULE_NOT_FOUND when you try to run it, because tslib wasn't present. After running yarn install tslib, it then works.

Is tslib a runtime requirement of this library, even if you're not using TypeScript? Not a big deal for me because I do, but thought you should be aware.

Full error internal/modules/cjs/loader.js:626 throw err; ^

Error: Cannot find module 'tslib'
Require stack:

  • /Users/omardiab/code/open-source/metascraper-demo/node_modules/tldts/build/cjs/index.js
  • /Users/omardiab/code/open-source/metascraper-demo/node_modules/metascraper-clearbit/index.js
  • /Users/omardiab/code/open-source/metascraper-demo/index.js
    at Function.Module._resolveFilename (internal/modules/cjs/loader.js:623:15)
    at Function.Module._load (internal/modules/cjs/loader.js:527:27)
    at Module.require (internal/modules/cjs/loader.js:681:19)
    at require (internal/modules/cjs/helpers.js:16:16)
    at Object. (/Users/omardiab/code/open-source/metascraper-demo/node_modules/tldts/build/cjs/index.js:3:15)
    at Module._compile (internal/modules/cjs/loader.js:774:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:785:10)
    at Module.load (internal/modules/cjs/loader.js:641:32)
    at Function.Module._load (internal/modules/cjs/loader.js:556:12)
    at Module.require (internal/modules/cjs/loader.js:681:19) {
    code: 'MODULE_NOT_FOUND',
    requireStack: [
    '/Users/omardiab/code/open-source/metascraper-demo/node_modules/tldts/build/cjs/index.js',
    '/Users/omardiab/code/open-source/metascraper-demo/node_modules/metascraper-clearbit/index.js',
    '/Users/omardiab/code/open-source/metascraper-demo/index.js'
    ]
    }

generate non-private subset

(First off, your library is EXACTLY what we were looking for. Many folks have a PSL distribution on npm, but you've done awesome work here. Thank you!)

The PSL has all its public ICANN domains and then at the bottom a bunch of PRIVATE domains.

image

For our uses, we don't care about the private domains, as we just need the PSL for improved URL parsing/display for the publically accessible web. Also, we're very bundle size conscious. (So we appreciate the trie!) Publishing an icann-only subset (or whatever you want to name it) would be super helpful.

Here's an initial idea of the size difference:
image

cc @alexnj

Unable to parse wildcard subdomain

> const { parse } = require('tldts');
undefined
> parse('*.google.com')
{
  domain: null,
  domainWithoutSuffix: null,
  hostname: null,
  isIcann: null,
  isIp: false,
  isPrivate: null,
  publicSuffix: null,
  subdomain: null
}

I've seen #134 however I was expecting this example to work.

Can someone provide an example of a valid use of wildcards with this library? I haven't investigated yet but I'm assuming I'm just missing something simple so any clarification would be appreciated -- thanks!

parse of text 'arguments.constructor' crashes module

Issue:
Module crashes when 'arguments.constructor' is passed as a value to the parse() method
Example strings:
arguments.constructor
http://arguments.constructor
http://arguments.constructor/hello.html
http://www.arguments.constructor/hello.html

Version:
NodeJS: 10 and 11(current)
tldts: 4.0.6
OS: Windows 10

Test Application Source:
`const tldts = require('tldts');

let result = tldts.parse(process.argv[2]);

console.log(domain:${result.domain});
console.log(hostname:${result.hostname});
console.log(isIcann:${result.isIcann});
console.log(isIp:${result.isIp});
console.log(isPrivate:${result.isPrivate});
console.log(publicSuffix:${result.publicSuffix});
console.log(subdomain:${result.subdomain});

Test with Successful Run
C:\tldtest>node tldtstest.js https://github.com/remusao/tldts
domain:github.com
hostname:github.com
isIcann:true
isIp:false
isPrivate:false
publicSuffix:com
subdomain:
image

Test to reproduce error:
C:\tldtest>node tldtstest.js https://arguments.constructor/remusao/tldts
`C:\tldtest\node_modules\tldts\dist\tldts.cjs.js:277
node = node[parts[index]] || node['*'];
^

TypeError: 'caller', 'callee', and 'arguments' properties may not be accessed on strict mode functions or the arguments objects for calls to them
at lookupInTrie (C:\tldtest\node_modules\tldts\dist\tldts.cjs.js:277:20)
at suffixLookup (C:\tldtest\node_modules\tldts\dist\tldts.cjs.js:293:24)
at getPublicSuffix (C:\tldtest\node_modules\tldts\dist\tldts.cjs.js:196:21)
at parseImpl (C:\tldtest\node_modules\tldts\dist\tldts.cjs.js:243:30)
at Object.parse (C:\tldtest\node_modules\tldts\dist\tldts.cjs.js:315:12)
at Object. (C:\tldtest\tldtstest.js:3:20)
at Module._compile (internal/modules/cjs/loader.js:816:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:827:10)
at Module.load (internal/modules/cjs/loader.js:685:32)
at Function.Module._load (internal/modules/cjs/loader.js:620:12)`
image

Bin: batch input

For the command line program, please consider to accept inputs from STDIN.

Then it would process each result, outputting the JSON to STDOUT.

The jq command can already handle a stack of JSONS concatenated together like this.

Doesnt parse domains with leading special chars

Related to #1523

As an example, this will not parse IPFS DNSLink domain records names. This led to some obscure errors and a few hours of debugging my own code to figure out my code was not the problem, and unfortunately will lead to me opting to use parse-domains over tldts (which is a little cumbersome, as I was already using tldts quite generously)

DNS names allow leading underscores and dashes. They didn't originally, but the spec was updated to allow for this (hence, IPFS employs a DNSLink TXT record with a subdomain prefix _dnslink)

`parse` with only selected fields

In my use case, I only want to get the domain and the subdomain of a domain.

I can simply do tldts.getDomain() and then tldts.getSubdomain(), but the input domain would be parsed twice.

Under the hood, tldts-core does have FLAG constants, but it is not exposed to the tldts. On the other hand, the suffixLookup function is not exported so I can't build my own parseDomainAndSubdomain on tldts-core directly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.