Coder Social home page Coder Social logo

tb0hdan / domains Goto Github PK

View Code? Open in Web Editor NEW
677.0 30.0 101.0 1.72 GB

World’s single largest Internet domains dataset

Home Page: https://domainsproject.org

License: BSD 3-Clause "New" or "Revised" License

Shell 35.75% HTML 64.25%
yacy scrapy dataset search-engines internet-domains colly

domains's Introduction

Domains Project: Processing petabytes of data so you don't have to

Domain count GitHub stars GitHub forks GitHub code size in bytes GitHub repo size GitHub issues GitHub license GitHub commit activity

World's single largest Internet domains dataset

This public dataset contains freely available sorted list of Internet domains.

Dataset statistics

Project news

Support needed!

You can support this project by doing any combination of the following:

  • Posting a link on your website to DomainsProject
  • Opening issue and attaching other domain datasets that are not here yet (be sure to scroll through this README first)
  • Publishing research work and linking to DomainsProject
  • Sponsoring this project. See Subscriptions

Milestones:

Domains

  • 10 Million
  • 100 Million
  • 1 Billion
  • 1.7 Billion

(Wasted) Internet traffic:

  • 500TB
  • 925TB
  • 1PB
  • 1.3PB
  • 1.5PB

Random facts:

  • More than 1TB of Internet traffic is just 3 Mbytes of compressed data
  • 1 million domains is just 5 Mbytes compressed
  • More than 5.7PB of Internet traffic is necessary to crawl 1.7 billion domains (3.4TB / 1 million).
  • Only 4.6Gb of disk space is required to store 1.7 billion domains in compressed form
  • 1Gbit fully saturated link is good for about 2 million new domains every day
  • 8c/16t and 64 Gbytes of RAM machine is good for about 2 million new domains every day
  • 2 ISC Bind9 instances (>400 Mbytes RSS each) are required to get 2 million new domains every day
  • After reaching 9 million domains repository was switched to compressed files. Please use freely available XZ to unpack files.
  • After reaching 30 million records, files were moved to /data so repository doesn't have it's README at the very bottom.

Used by

CloudSEK

Using dataset

This repository empoys Git LFS technology, therefore user has to use both git lfs and xz to retrieve data. Cloning procedure is as follows:

git clone https://github.com/tb0hdan/domains.git
cd domains
git lfs install
./unpack.sh

Getting unfiltered dataset

Subscribers have access to raw data is available at https://dataset.domainsproject.org

Some other availabe features:

  • TLD only
  • Websocket for new domains
  • DNS JSON (with historical data)
wget -m https://dataset.domainsproject.org

Data format

After unpacking, domain lists are just text files (~49Gb at 1.7 bil) with one domain per line. Sample for data/afghanistan/domain2multi-af.txt:

1tv.af
1tvnews.af
3rdeye.af
8am.af
aan.af
acaa.gov.af
acb.af
acbr.gov.af
acci.org.af
ach.af
acku.edu.af
acsf.af
adras.af
aeiti.af

Search engines and crawlers

Crawlers

Domains Project bot

Domains Project uses crawler and DNS checks to get new domains.

DNS checks client is in early stages and is used by select few. It is called Freya and I'm working on making it stable and good enough for general public.

HTTP crawler is being rewritten as well. It is called Idun

Typical user agent for Domains Project bot looks like this:

Mozilla/5.0 (compatible; Domains Project/1.0.8; +https://domainsproject.org)

Some older versions have set to Github repo:

Mozilla/5.0 (compatible; Domains Project/1.0.4; +https://github.com/tb0hdan/domains)

All data in this dataset is gathered using Scrapy and Colly frameworks.

Starting with version 1.0.7 crawler has partial robots.txt support and rate limiting. Please open issue if you experience any problems. Don't forget to include your domain.

Disabling Domains Project bot access to your website

Add this to your robots.txt:

User-agent: domainsproject.org
Disallow:/

or this:

User-agent: Domains Project
Disallow:/

bot checks for both.

Others

Yacy

Yacy is a great opensource search engine. Here's my post on Yacy forum: https://searchlab.eu/t/domain-list-for-easier-search-bootstrapping/231

Additional sources

Rapid7 Sonar FDNS - no longer open

List of .FR domains from AfNIC.fr

Majestic Million

Internetstiftelsen Zone Data

DNS Census 2013

bigdatanews extract from Common Crawl (circa 2012)

Common Crawl - March/April 2020

The CAIDA UCSD IPv4 Routed /24 DNS Names Dataset - January/July 2019

GSA Data

OpenPageRank 10m hosts

Switch.ch Open Data

Slovak domains - Open Data

Research

This dataset can be used for research. There are papers that cover different topics. I'm just going to leave links to them here for reference.

Published works based on this dataset

Phishing Protection SPF, DKIM, DMARC

Email address analysis (Czech)

Proteus: A Self-Designing Range Filter

Large Scale String Analytics in Arkouda

Fake Phishing: Setup, detection, and take-down

Cloudy with a Chance of Cyberattacks: Dangling Resources Abuse on Cloud Platforms

Analysis

The Internet of Names: A DNS Big Dataset

Enabling Network Security Through Active DNS Datasets

Re-registration and general statistics

Analysis of the Internet Domain Names Re-registration Market

Lexical analysis of malicious domains

Detection of malicious domains through lexical analysis

Malicious Domain Names Detection Algorithm Based on Lexical Analysis and Feature Quantification

Detecting Malicious URLs Using Lexical Analysis

domains's People

Contributors

cesnokov avatar tb0hdan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

domains's Issues

Wget returns 403 forbidden

(same issue as #5)

Hey, I'm including one of the datasets in my script. I'm using Python's wget package (which uses the standard wget user agent) and Cloudflare gives me a 403 error.

urllib.error.HTTPError: HTTP Error 403: Forbidden

Issue with your patreon

Hello, i’m trying to subscribe to your patreon and everytime i do it the account gets disabled, i wanted to know if you have any contact method like telegram or xmpp so we can discuss the payment.

Missing domains

Hi,

Thanks for this nice project. I found that some domains are missing from your dataset, such as "aaaa.com" or "azjj.com". Is there a reason for this?

Thanks.

Crawler causing SYN floods

Yesterday at 18:17 CEST we noted a SYN flood caused by the project crawler. Please implement request limits.

Add example of what data looks like

I'm interested in this data, but the download and space commitments look large so it would be great if there were some examples of the records I could expect to see in the README

Do not download large files such as FLAC and MP3

Downloading these places a significant load on servers, and most are not going to contain URL metadata of use to the project.

This is probably true of image files too.

At least please explicitly describe a suitable robots.txt User-agent name to stop the tool scraping inappropriate sites/subtrees.

Rgds

Damon

Raw data page returns 403.

I'm trying to download the data from the website dataset.domainsproject.org but I'm getting an http 403.

# wget -m https://dataset.domainsproject.org
--2021-05-20 16:14:26--  https://dataset.domainsproject.org/
Resolving dataset.domainsproject.org (dataset.domainsproject.org)... 104.26.14.47, 104.26.15.47, 172.67.72.229, ...
Connecting to dataset.domainsproject.org (dataset.domainsproject.org)|104.26.14.47|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-05-20 16:14:26 ERROR 403: Forbidden.

Extra domain

While looking where dunctebot.link was getting requests from on the internet I stumbled upon this cool dataset. Since I'm getting rid of that domain soon here's the domain that will replace it duncte.bot

Error while clone the project : repository is over its data quota

Cloning into 'domains'...
remote: Enumerating objects: 170161, done.
remote: Counting objects: 100% (17313/17313), done.
remote: Compressing objects: 100% (17245/17245), done.
remote: Total 170161 (delta 75), reused 17304 (delta 67), pack-reused 152848
Receiving objects: 100% (170161/170161), 1.71 GiB | 16.16 MiB/s, done.
Resolving deltas: 100% (1105/1105), done.
Downloading data/afghanistan/domain2multi-af00.txt.xz (79 KB)
Error downloading object: data/afghanistan/domain2multi-af00.txt.xz (4b3903e): Smudge error: Error downloading data/afghanistan/domain2multi-af00.txt.xz (4b3903eef1e6e05f9f526e7eaa667ce0528d149dc12169299de8e3868601e4de): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Errors logged to '/home/system/new/domains/.git/lfs/logs/20240202T235755.763457986.log'.
Use git lfs logs last to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: data/afghanistan/domain2multi-af00.txt.xz: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

can't be cloned due to error

git clone is not working on this repo and gets the following error

Cloning into 'domains'...
remote: Enumerating objects: 170140, done.
remote: Counting objects: 100% (17292/17292), done.
remote: Compressing objects: 100% (17265/17265), done.
remote: Total 170140 (delta 61), reused 17250 (delta 26), pack-reused 152848
Receiving objects: 100% (170140/170140), 1.71 GiB | 6.87 MiB/s, done.
Resolving deltas: 100% (1091/1091), done.
Downloading data/afghanistan/domain2multi-af00.txt.xz (79 KB)
Error downloading object: data/afghanistan/domain2multi-af00.txt.xz (4b3903e): Smudge error: Error downloading data/afghanistan/domain2multi-af00.txt.xz (4b3903eef1e6e05f9f526e7eaa667ce0528d149dc12169299de8e3868601e4de): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Anyone else having the same issue?

ICANN CZDS to seed domains

Would the ICANN CZDS API (https://czds.icann.org/) be a useful mechanism for you to more efficiently pull a list of all domains across the web? It allows you to apply for access to the zone data for each TLD, and then using their API tool, you can quickly download the zone file for each TLD that approves your request for access. This approach wouldn't be effective for finding subdomains though, so further work would be needed there, but perhaps this would be a good way to seed your scan?

I was also wondering if you tracked A records, or similar for each domain. Looks like the dataset is just domains, so I suspect the answer is no, but thought I would ask just in case I'm missing it somewhere as you mentioned using DNS checks in the README.

If you are only searching domains, why is your crawler scraping pages???

  1. It doesn't appear that your crawler checks for a robots.txt before crawling a site. This is BAD practice if you want to be a 'legitimate' bot. Some sites do NOT want to be crawled by random bots, or there are pages you shouldn't be crawling (for various reasons)...

  2. You appear to have ZERO rate limiting for your crawl speed. That is ABUSE... I count 40 PAGES/sec request rate (and that doesn't include other content)...

Error when downloading

Hi, I'm getting an error message when cloning the repo:

error: 472 bytes of body are still expected3.29 MiB | 2.00 KiB/s
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

And if I try to download it as a Zip file, when I try to open it, I get a corrupt file.

Thanks!

Domain dataset source: ICANN Centralized Zone Data Service (CZDS)

ICANN (Internet Corporation for Assigned Names and Numbers) provides a Centralized Zone Data Service (CZDS) where you can request free access to the Zone Files provided by participating generic Top-Level Domains (gTLDs).

Currently there are over 1100 generic TLDs (including the big ones like com, net, org) available for download (terms-of-service limit: maximum of one download per 24 hours per zonefile).

You can scrape your domains from the zonefiles.


You will need to provide a request "reason" to each zonefile registrar.

From my experience, most (>98%) registrars expect the following information (a handful will reject your access request if any of the following information is omitted):

  • Name
  • Email
  • IP Address
  • Physical Address (Building, Street, Postcode etc.)
  • Phone Number
  • Reason (what you intend to do with the zonefiles)

It is also possible to automatically request zonefile access rights and download zonefiles via the ICANN CZDS API instead of from the website user interface.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.