Domains Project: Processing petabytes of data so you don’t have to

This public dataset contains freely available sorted list of Internet domains.

Dataset statistics

Project news

(Wasted) Internet traffic:

Random facts:

Using dataset

This repository empoys Git LFS technology, therefore user has to use both git lfs and xz to retrieve data. Cloning procedure is as follows:

git clone
cd domains
git lfs install

Getting unfiltered dataset

Subscribers have access to raw data is available at

Some other availabe features:

wget -m

Data format

After unpacking, domain lists are just text files (~49Gb at 1.7 bil) with one domain per line. Sample for data/afghanistan/domain2multi-af.txt:

Search engines and crawlers


Domains Project bot

Domains Project uses crawler and DNS checks to get new domains.

DNS checks client is in early stages and is used by select few. It is called Freya and I’m working on making it stable and good enough for general public.

HTTP crawler is being rewritten as well. It is called Idun

Typical user agent for Domains Project bot looks like this:

Mozilla/5.0 (compatible; Domains Project/1.0.8; +

Some older versions have set to Github repo:

Mozilla/5.0 (compatible; Domains Project/1.0.4; +

All data in this dataset is gathered using Scrapy and Colly frameworks.

Starting with version 1.0.7 crawler has partial robots.txt support and rate limiting. Please open issue if you experience any problems. Don’t forget to include your domain.

Disabling Domains Project bot access to your website

Add this to your robots.txt:


or this:

User-agent: Domains Project

bot checks for both.



Yacy is a great opensource search engine. Here’s my post on Yacy forum:

Additional sources

Rapid7 Sonar FDNS - no longer open

List of .FR domains from

Majestic Million

Internetstiftelsen Zone Data

DNS Census 2013

bigdatanews extract from Common Crawl (circa 2012)

Common Crawl - March/April 2020

The CAIDA UCSD IPv4 Routed /24 DNS Names Dataset - January/July 2019

GSA Data

OpenPageRank 10m hosts Open Data

Slovak domains - Open Data


This dataset can be used for research. There are papers that cover different topics. I’m just going to leave links to them here for reference.

