Domains Project: Processing petabytes of data so you don’t have to
World’s single largest Internet domains dataset
This public dataset contains freely available sorted list of Internet domains.
- 10 Million
- 20 Million
- 30 Million
- 50 Million
- 70 Million
- 100 Million
- 150 Million
- 200 Million
- 250 Million
- 300 Million
- 500 Million
- 750 Million
- 1 Billion
- 1.2 Billion
- 1.5 Billion
- 1.7 Billion
(Wasted) Internet traffic:
- More than 1TB of Internet traffic is just 3 Mbytes of compressed data
- 1 million domains is just 5 Mbytes compressed
- More than 1PB of Internet traffic is necessary to crawl 375 million domains (3.4TB / 1 million).
- Only 2.4Gb of disk space is required to store 375 million domains in compressed form
- 1Gbit fully saturated link is good for about 2 million new domains every day
- 8c/16t and 64 Gbytes of RAM machine is good for about 2 million new domains every day
- 2 ISC Bind9 instances (>400 Mbytes RSS each) are required to get 2 million new domains every day
- After reaching 9 million domains repository was switched to compressed files. Please use freely available XZ to unpack files.
- After reaching 30 million records, files were moved to
/dataso repository doesn’t have it’s README at the very bottom.
This repository empoys Git LFS technology, therefore user
has to use both
git lfs and
xz to retrieve data. Cloning procedure is as follows:
git clone https://github.com/tb0hdan/domains.git cd domains git lfs install ./unpack.sh
Getting unfiltered dataset
Raw data may be available at https://dataset.domainsproject.org, though it is recommended to use Github repo.
wget -m https://dataset.domainsproject.org
After unpacking, domain lists are just text files (~7.9Gb at 375 mil) with one domain per line.
1tv.af 1tvnews.af 3rdeye.af 8am.af aan.af acaa.gov.af acb.af acbr.gov.af acci.org.af ach.af acku.edu.af acsf.af adras.af aeiti.af
Search engines and crawlers
Domains Project bot
Domains Project uses crawler and DNS checks to get new domains.
DNS checks client is in early stages and is used by select few. It is called Freya and I’m working on making it stable and good enough for general public.
Typical user agent for Domains Project bot looks like this:
Mozilla/5.0 (compatible; Domains Project/1.0.8; +https://domainsproject.org)
Some older versions have set to Github repo:
Mozilla/5.0 (compatible; Domains Project/1.0.4; +https://github.com/tb0hdan/domains)
Starting with version
1.0.7 crawler has partial
and rate limiting. Please open issue if you experience any problems. Don’t forget to include your domain.
Yacy is a great opensource search engine. Here’s my post on Yacy forum: https://searchlab.eu/t/domain-list-for-easier-search-bootstrapping/231
This dataset can be used for research. There are papers that cover different topics. I’m just going to leave links to them here for reference.