Hidden Treasures of TLDs: How I Scraped Hackernews for Domain Names


Background Story

As somebody who often starts new projects, I often need to think about project names and domain names.

Derek Sivers once posted about how to find a good (and free) .com Domain, and I found it very inspiring. However, in some cases, there is the project name defined first - or you want to do a good play of words with the name and domain of the project. In those cases, you need to have a suitable TLD.

But finding such domain names is a tricky thing. If you go through Wikipedia, you end up with more than 1.2k TLDs. (Trust me, I did it).

So I needed to narrow it down. And I did so by running it through the filter of a bubble that a) seems relevant to me and b) was large enough: People who read and post on Hackernews. So I had my Raspberry Pi scraping the Hackernews API for about 3 Weeks (because of rate limits), and the results you find up there.

I had a database full of HN Stories since the very beginning, which accumulated to ~1GB.

Takeaways

  • There are currently 1283 relevant top-level domains.
  • The historical graph, while funny to look at, has not had that much to take away for a new project because some super relevant domains were just released some years ago (like .dev)
  • So I added an extra graph for the last full year (now: 2022). It has a logarithmic axe for the vast difference between .com and other domains.

Method

Filter Settings:

  • Original: The original generic top-level domains from the Internet’s early development predate ICANN’s creation in 1998.
  • Country: Country code top-level domains (ccTLD)
  • Generic: Generic top-level domains (gTLDs) - excluding geographic and brand gTLDs.
  • Brand: brand gTLDs
  • Geographic: geographic gTLDs

Not taken into account were the following TLDs:

URLs with one of the following patterns were ignored (for a full account of 0.001%):

Code

I wrote a small binary in Go to leverage the beautiful go routines for fast scraping. But it turned out that there was a rate limit in place, and I needed to limit the parallel routines to 4. (which is better than 1 - or writing it in almost any other language)

The entries were inserted in a single table MySQL DB. Why MySQL and not PostgreSQL? Just because I have used it for almost 20 years.

I created a JSON file from the database with another Go binary to ship it with the Frontend. The most exciting part here is how I queried the data for specific years and how I substring-ed the column of URLs to group it by the top-level domain. By the way: A huge shout-out to this Stackoverflow thread that teaches me how to do it!

In the Frontend, it is vanilla Javascript leveraging Plotly. I implemented this into a shortcode that I added to my Hugo template.

The data gathering and transformation code must be better documented, but it is undoubtedly open source.