Hidden Treasures of TLDs: How I Scraped Hackernews for Domain Names
Background Story ¶
As somebody who often starts new projects, I often need to think about project names and domain names.
Derek Sivers once posted about how to find a good (and free) .com Domain, and I found it very inspiring. However, in some cases, there is the project name defined first - or you want to do a good play of words with the name and domain of the project. In those cases, you need to have a suitable TLD.
But finding such domain names is a tricky thing. If you go through Wikipedia, you end up with more than 1.2k TLDs. (Trust me, I did it).
So I needed to narrow it down. And I did so by running it through the filter of a bubble that a) seems relevant to me and b) was large enough: People who read and post on Hackernews. So I had my Raspberry Pi scraping the Hackernews API for about 3 Weeks (because of rate limits), and the results you find up there.
I had a database full of HN Stories since the very beginning, which accumulated to ~1GB.
- There are currently 1283 relevant top-level domains.
- The historical graph, while funny to look at, has not had that much to take away for a new project because some super relevant domains were just released some years ago (like .dev)
- So I added an extra graph for the last full year (now: 2022). It has a logarithmic axe for the vast difference between .com and other domains.
- Original: The original generic top-level domains from the Internet’s early development predate ICANN’s creation in 1998.
- Country: Country code top-level domains (ccTLD)
- Generic: Generic top-level domains (gTLDs) - excluding geographic and brand gTLDs.
- Brand: brand gTLDs
- Geographic: geographic gTLDs
Not taken into account were the following TLDs:
- test top-level domains (tTLD)
- generic-restricted top-level domains (grTLD)
- Internationalized brand top-level domains
- Special-Use Domains
- Non-IANA domains
- Infrastructure top-level domains
URLs with one of the following patterns were ignored (for a full account of 0.001%):
- Wrong links: http://blog.plover.com./prog/lib.html (they do not make so much in the mass)
- URLs without Domain names that were pointing to IP addresses.
I wrote a small binary in Go to leverage the beautiful go routines for fast scraping. But it turned out that there was a rate limit in place, and I needed to limit the parallel routines to 4. (which is better than 1 - or writing it in almost any other language)
The entries were inserted in a single table MySQL DB. Why MySQL and not PostgreSQL? Just because I have used it for almost 20 years.
I created a JSON file from the database with another Go binary to ship it with the Frontend. The most exciting part here is how I queried the data for specific years and how I substring-ed the column of URLs to group it by the top-level domain. By the way: A huge shout-out to this Stackoverflow thread that teaches me how to do it!
The data gathering and transformation code must be better documented, but it is undoubtedly open source.