tb0hdan / domains Goto Github PK

View Code? Open in Web Editor NEW

677.0 30.0 101.0 1.72 GB

World’s single largest Internet domains dataset

Home Page: https://domainsproject.org

License: BSD 3-Clause "New" or "Revised" License

Shell 35.75% HTML 64.25%

yacy scrapy dataset search-engines internet-domains colly

domains's Introduction

Domains Project: Processing petabytes of data so you don't have to

World's single largest Internet domains dataset

This public dataset contains freely available sorted list of Internet domains.

Dataset statistics

Project news

Support needed!

You can support this project by doing any combination of the following:

Posting a link on your website to DomainsProject
Opening issue and attaching other domain datasets that are not here yet (be sure to scroll through this README first)
Publishing research work and linking to DomainsProject
Sponsoring this project. See Subscriptions

Milestones:

Domains

10 Million
100 Million
1 Billion
1.7 Billion

(Wasted) Internet traffic:

Random facts:

More than 1TB of Internet traffic is just 3 Mbytes of compressed data
1 million domains is just 5 Mbytes compressed
More than 5.7PB of Internet traffic is necessary to crawl 1.7 billion domains (3.4TB / 1 million).
Only 4.6Gb of disk space is required to store 1.7 billion domains in compressed form
1Gbit fully saturated link is good for about 2 million new domains every day
8c/16t and 64 Gbytes of RAM machine is good for about 2 million new domains every day
2 ISC Bind9 instances (>400 Mbytes RSS each) are required to get 2 million new domains every day
After reaching 9 million domains repository was switched to compressed files. Please use freely available XZ to unpack files.
After reaching 30 million records, files were moved to /data so repository doesn't have it's README at the very bottom.

Used by

Using dataset

This repository empoys Git LFS technology, therefore user has to use both git lfs and xz to retrieve data. Cloning procedure is as follows:

git clone https://github.com/tb0hdan/domains.git
cd domains
git lfs install
./unpack.sh

Getting unfiltered dataset

Subscribers have access to raw data is available at https://dataset.domainsproject.org

Some other availabe features:

TLD only
Websocket for new domains
DNS JSON (with historical data)

wget -m https://dataset.domainsproject.org

Data format

After unpacking, domain lists are just text files (~49Gb at 1.7 bil) with one domain per line. Sample for data/afghanistan/domain2multi-af.txt:

1tv.af
1tvnews.af
3rdeye.af
8am.af
aan.af
acaa.gov.af
acb.af
acbr.gov.af
acci.org.af
ach.af
acku.edu.af
acsf.af
adras.af
aeiti.af

Search engines and crawlers

Crawlers

Domains Project bot

Domains Project uses crawler and DNS checks to get new domains.

DNS checks client is in early stages and is used by select few. It is called Freya and I'm working on making it stable and good enough for general public.

HTTP crawler is being rewritten as well. It is called Idun

Typical user agent for Domains Project bot looks like this:

Mozilla/5.0 (compatible; Domains Project/1.0.8; +https://domainsproject.org)

Some older versions have set to Github repo:

Mozilla/5.0 (compatible; Domains Project/1.0.4; +https://github.com/tb0hdan/domains)

All data in this dataset is gathered using Scrapy and Colly frameworks.

Starting with version 1.0.7 crawler has partial robots.txt support and rate limiting. Please open issue if you experience any problems. Don't forget to include your domain.

Disabling Domains Project bot access to your website

Add this to your robots.txt:

User-agent: domainsproject.org
Disallow:/

or this:

User-agent: Domains Project
Disallow:/

bot checks for both.

Others

Yacy

Yacy is a great opensource search engine. Here's my post on Yacy forum: https://searchlab.eu/t/domain-list-for-easier-search-bootstrapping/231

Additional sources

Rapid7 Sonar FDNS - no longer open

List of .FR domains from AfNIC.fr

Majestic Million

Internetstiftelsen Zone Data

DNS Census 2013

bigdatanews extract from Common Crawl (circa 2012)

Common Crawl - March/April 2020

The CAIDA UCSD IPv4 Routed /24 DNS Names Dataset - January/July 2019

GSA Data

OpenPageRank 10m hosts

Switch.ch Open Data

Slovak domains - Open Data

Research

This dataset can be used for research. There are papers that cover different topics. I'm just going to leave links to them here for reference.

Published works based on this dataset

Phishing Protection SPF, DKIM, DMARC

Email address analysis (Czech)

Proteus: A Self-Designing Range Filter

Large Scale String Analytics in Arkouda

Fake Phishing: Setup, detection, and take-down

Cloudy with a Chance of Cyberattacks: Dangling Resources Abuse on Cloud Platforms

Analysis

The Internet of Names: A DNS Big Dataset

Enabling Network Security Through Active DNS Datasets

Re-registration and general statistics

Analysis of the Internet Domain Names Re-registration Market

Lexical analysis of malicious domains

Detection of malicious domains through lexical analysis

Malicious Domain Names Detection Algorithm Based on Lexical Analysis and Feature Quantification

Detecting Malicious URLs Using Lexical Analysis

domains's People

Contributors

Stargazers

Watchers

Forkers

alexis4689 ktzgraph supperspiderman drsect0r shin701 srilikestosing samarpw cesnokov shanawarkhan avdemin7 famatte69 abromeit summercms 00mjk sultanzio bishab mervin0502 itoutsourcing86 qidouhai fasuto satan1a jora-ko progrocket vjingbi nntdoan rust3r nsknet josepowera umanshahzad arryboom ericdelasecu jimba86 mrheat qsdj workcha ledino masums 3453-315h simokx wha000tif tdr130 angrygraywolf elliotwutingfeng mhmtbsbyndr adam-9123674 vladretca eboese adilsoybali isolisp yuksel75 jayhutajulu1 xpcom-bsd essayy dcnick3 kenjoe41 banditaf saboor-hakimi yvenone hokiegeek2 jimwangzx sevmardi ajinwu xiaolaomen duckheada infernexio markwhite005 a1vinsmith igorb gavrias amirulandalib toman-sasaki dnsplus xell8 feathered-arch suryatmodulus fuck007910 taoscorpi barkoczy vigov5 surajmishra0 gecatalin sandarkin share-i daydiff sonofescobar1337 waldiirawan hokim-m usever xsmael cybersecops rfcti tianlelyd jqk6 venhow rbjs sn1p2r78 inatic windywxf

domains's Issues

Domain dataset source (ipsniper.info)

Hey tb0hdan, glad to see that you're still up and about, I've found a large source of domains at https://ipsniper.info/numbers.html, apparently they have around 286 million domains and it's being updated on a regular basis.

They've also got lots of domain/IP related datasets at https://ipsniper.info

Dataset Source: crawler.ninja

https://crawler.ninja

This person has been crawling Alexa's Top 1M domains for the past 7 years and has made the raw data public domain at the above website.

Wget returns 403 forbidden

(same issue as #5)

Hey, I'm including one of the datasets in my script. I'm using Python's wget package (which uses the standard wget user agent) and Cloudflare gives me a 403 error.

urllib.error.HTTPError: HTTP Error 403: Forbidden

Issue with your patreon

Hello, i’m trying to subscribe to your patreon and everytime i do it the account gets disabled, i wanted to know if you have any contact method like telegram or xmpp so we can discuss the payment.

Missing domains

Hi,

Thanks for this nice project. I found that some domains are missing from your dataset, such as "aaaa.com" or "azjj.com". Is there a reason for this?

Thanks.

401 Authorization Required

https://dataset.domainsproject.org/ is returning "401 Authorization Required"

If this is no longer being used, it should be removed from the homepage and the README

Certificate Transparency Data?

This seems like a good source of data for your use case.
Every SSL certificate (including subject and subject alternate names) gets written to a log. You can read the logs and extract the domains.

https://certificate.transparency.dev/

Might even be able to use the stream from Cali Dog Security to feed you the data to parse and build onto your data set:

https://certstream.calidog.io/

Please specifiy your crawler User-agent for robots.txt

It does not appear to be documented, and your crawler is wasting lots of bandwidth downloading lots of media files that do not contain any URL metadata.

I recently sponsored this project

I recently sponsored this project and, where can i get username password to access https://dataset.domainsproject.org/ ?

Crawler causing SYN floods

Yesterday at 18:17 CEST we noted a SYN flood caused by the project crawler. Please implement request limits.

Add example of what data looks like

I'm interested in this data, but the download and space commitments look large so it would be great if there were some examples of the records I could expect to see in the README

Do not download large files such as FLAC and MP3

Downloading these places a significant load on servers, and most are not going to contain URL metadata of use to the project.

This is probably true of image files too.

At least please explicitly describe a suitable robots.txt User-agent name to stop the tool scraping inappropriate sites/subtrees.

Rgds

Damon

Suggestion: Add amount of storage taken by data into STATS.md

It is sometimes interesting to see how much size those domains take¹.

It would be cool if you guys keep track this total size, in gigabytes, of all data packed/unpacked (archived/non-archived).

What size of google dns DB? ↩

Raw data page returns 403.

I'm trying to download the data from the website dataset.domainsproject.org but I'm getting an http 403.

# wget -m https://dataset.domainsproject.org
--2021-05-20 16:14:26--  https://dataset.domainsproject.org/
Resolving dataset.domainsproject.org (dataset.domainsproject.org)... 104.26.14.47, 104.26.15.47, 172.67.72.229, ...
Connecting to dataset.domainsproject.org (dataset.domainsproject.org)|104.26.14.47|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-05-20 16:14:26 ERROR 403: Forbidden.

Domain dataset source: CubDomain.com

I found a website with large lists of newly registered domains updated on a daily basis, perhaps you would consider integrating it into your collection: https://www.cubdomain.com/domains-registered-dates/1

You'll need to do some scraping though, I can help with that.

French domain names from public orgs

Hi!

You may be interested by this dataset: https://github.com/etalab/noms-de-domaine-organismes-publics

If I get my shell-fu right (comm -13 theirs mine | wc -l), it may contain 22673 domains that you don't have, it's "small" as it only contains domains from public orgs (and you already have a lot from AFNIC open data), but it's probably worth it?

Extra domain

While looking where dunctebot.link was getting requests from on the internet I stumbled upon this cool dataset. Since I'm getting rid of that domain soon here's the domain that will replace it duncte.bot

Error while clone the project : repository is over its data quota

Cloning into 'domains'...
remote: Enumerating objects: 170161, done.
remote: Counting objects: 100% (17313/17313), done.
remote: Compressing objects: 100% (17245/17245), done.
remote: Total 170161 (delta 75), reused 17304 (delta 67), pack-reused 152848
Receiving objects: 100% (170161/170161), 1.71 GiB | 16.16 MiB/s, done.
Resolving deltas: 100% (1105/1105), done.
Downloading data/afghanistan/domain2multi-af00.txt.xz (79 KB)
Error downloading object: data/afghanistan/domain2multi-af00.txt.xz (4b3903e): Smudge error: Error downloading data/afghanistan/domain2multi-af00.txt.xz (4b3903eef1e6e05f9f526e7eaa667ce0528d149dc12169299de8e3868601e4de): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Errors logged to '/home/system/new/domains/.git/lfs/logs/20240202T235755.763457986.log'.
Use git lfs logs last to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: data/afghanistan/domain2multi-af00.txt.xz: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

Description of data sources

First thank you for the database! I wonder could you please add more details of data sources? The README only provides an extremly vague description ("crawler and DNS checks").

To start with, I assume that the names provided by the links in https://github.com/tb0hdan/domains#additional-sources are all included in your database?

More Statistics

Hello, Is Possible To Show Statistics Like This ::::: https://dnscensus2013.neocities.org/statistics.html
Show Numbers By TLDs Please

Newa Our Master.

Slovak domains dataset (open access)

SK-NIC provides the zonefile for all .sk domains at https://sk-nic.sk/subory/domains.txt

Reference: https://sk-nic.sk/en/home/

Some controversy: https://medium.com/@Oskar456/stolen-sk-domain-717e070f6735

can't be cloned due to error

git clone is not working on this repo and gets the following error

Cloning into 'domains'...
remote: Enumerating objects: 170140, done.
remote: Counting objects: 100% (17292/17292), done.
remote: Compressing objects: 100% (17265/17265), done.
remote: Total 170140 (delta 61), reused 17250 (delta 26), pack-reused 152848
Receiving objects: 100% (170140/170140), 1.71 GiB | 6.87 MiB/s, done.
Resolving deltas: 100% (1091/1091), done.
Downloading data/afghanistan/domain2multi-af00.txt.xz (79 KB)
Error downloading object: data/afghanistan/domain2multi-af00.txt.xz (4b3903e): Smudge error: Error downloading data/afghanistan/domain2multi-af00.txt.xz (4b3903eef1e6e05f9f526e7eaa667ce0528d149dc12169299de8e3868601e4de): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Anyone else having the same issue?

ICANN CZDS to seed domains

Would the ICANN CZDS API (https://czds.icann.org/) be a useful mechanism for you to more efficiently pull a list of all domains across the web? It allows you to apply for access to the zone data for each TLD, and then using their API tool, you can quickly download the zone file for each TLD that approves your request for access. This approach wouldn't be effective for finding subdomains though, so further work would be needed there, but perhaps this would be a good way to seed your scan?

I was also wondering if you tracked A records, or similar for each domain. Looks like the dataset is just domains, so I suspect the answer is no, but thought I would ask just in case I'm missing it somewhere as you mentioned using DNS checks in the README.

If you are only searching domains, why is your crawler scraping pages???

It doesn't appear that your crawler checks for a robots.txt before crawling a site. This is BAD practice if you want to be a 'legitimate' bot. Some sites do NOT want to be crawled by random bots, or there are pages you shouldn't be crawling (for various reasons)...
You appear to have ZERO rate limiting for your crawl speed. That is ABUSE... I count 40 PAGES/sec request rate (and that doesn't include other content)...

Error when downloading

Hi, I'm getting an error message when cloning the repo:

error: 472 bytes of body are still expected3.29 MiB | 2.00 KiB/s
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

And if I try to download it as a Zip file, when I try to open it, I get a corrupt file.

Thanks!

Domain dataset source: ICANN Centralized Zone Data Service (CZDS)

ICANN (Internet Corporation for Assigned Names and Numbers) provides a Centralized Zone Data Service (CZDS) where you can request free access to the Zone Files provided by participating generic Top-Level Domains (gTLDs).

Currently there are over 1100 generic TLDs (including the big ones like com, net, org) available for download (terms-of-service limit: maximum of one download per 24 hours per zonefile).

You can scrape your domains from the zonefiles.

You will need to provide a request "reason" to each zonefile registrar.

From my experience, most (>98%) registrars expect the following information (a handful will reject your access request if any of the following information is omitted):

Name
Email
IP Address
Physical Address (Building, Street, Postcode etc.)
Phone Number
Reason (what you intend to do with the zonefiles)

It is also possible to automatically request zonefile access rights and download zonefiles via the ICANN CZDS API instead of from the website user interface.