domains

Domains Project: World’s single largest internet domains dataset

View on GitHub

Welcome to Domains Project!

Domain count GitHub stars GitHub forks GitHub code size in bytes GitHub repo size GitHub issues GitHub license GitHub commit activity

World’s single largest internet domains dataset

This public dataset contains freely available sorted list of internet domains.

Dataset statistics

Milestones:

Domains

(Wasted) Internet traffic:

Random facts:

Using dataset

Raw (unpacked and unfiltered) data may be available at https://dataset.domainsproject.org, though it is recommended to use Github repo.

This repository empoys Git LFS technology, therefore user has to use both git lfs and xz to retrieve data. Cloning procedure is as follows:

git clone https://github.com/tb0hdan/domains.git
cd domains
./unpack.sh

Data format

After unpacking, domain lists are just text files (~4.6Gb at 230 mil) with one domain per line. Sample for data/afghanistan/domain2multi-af.txt:

1tv.af
1tvnews.af
3rdeye.af
8am.af
aan.af
acaa.gov.af
acb.af
acbr.gov.af
acci.org.af
ach.af
acku.edu.af
acsf.af
adras.af
aeiti.af

Search engines and crawlers

Crawlers

Domains Project bot

Typical user agent for Domains Project bot looks like this:

Mozilla/5.0 (compatible; Domains Project/1.0.8; +https://domainsproject.org)

Some older versions have set to Github repo:

Mozilla/5.0 (compatible; Domains Project/1.0.4; +https://github.com/tb0hdan/domains)

All data in this dataset is gathered using Scrapy and Colly frameworks.

Crawler code for this project is available at: Domains Crawler

Starting with version 1.0.7 Domains Crawler has robots.txt support and rate limiting. Please open issue if you experience any problems. Don’t forget to include your domain.

Others

Yacy

Yacy is a great opensource search engine. Here’s my post on Yacy forum: https://searchlab.eu/t/domain-list-for-easier-search-bootstrapping/231

Additional sources

List of .FR domains from AfNIC.fr

Majestic Million

Internetstiftelsen Zone Data

DNS Census 2013

bigdatanews extract from Common Crawl (circa 2012)

Common Crawl - March/April 2020

Research

This dataset can be used for research. There are papers that cover different topics. I’m just going to leave links to them here for reference.

Re-registration and general statistics

Analysis of the Internet Domain Names Re-registration Market

Lexical analysis of malicious domains.

Detection of malicious domains through lexical analysis

Malicious Domain Names Detection Algorithm Based on Lexical Analysis and Feature Quantification

Detecting Malicious URLs Using Lexical Analysis