Finding Top FQDNs Per Day in DNSDB Export MTBL Files (Part One of a Three Part Series)

I. Introduction

There’s always been a lot of popular interest in lists of “top” Internet domains. That interest has lead to the publication of various lists such as:

The Alexa 500 Top Sites (longer lists are available for a fee)
The Cisco Umbrella One Million List
The Majestic One Million List
The Mozilla Top 500 List
The OpenPage Rank (“Domcop”) List
The Quantcast List of Top Websites
The World Web Sites Hosting Information Directory from myip.ms.

Many of these lists are based on observed web traffic volume, but customers who have purchased Farsight DNSDB Export (DNSDB “On-Premise”) can build a similar list for their own private use based on the data seen by Farsight’s passive DNS sensor network.

II. Inherent Data Limitations

If you’re a DNSDB Export customer and are thinking about exploring count data, you need to understand the inherent limitations associated with “count” data included in DNSDB. Recognize that the DNSDB “counts” seen for one domain will not be strictly comparable to the counts seen for another domain — the cache miss query counts will be strongly influenced by things like the TTLs of the respective domains. That is, a comparatively low-interest domain may appear to be getting “hit more often” than a another domain simply because one of those domains has set a short TTL, while the other domain is using a longer TTL.

Resolver configuration (at least downstream of Farsight sensor collection points) can also impact apparent query volume — imagine, for example, traffic from a site running a caching resolver that has a highly tailored cache management policy. One popular caching recursive resolver, Unbound, for example, has a number of potentially relevant settings:

cache-max-ttl (if set to an artificially low value will increase the frequency with which requeries occur)
cache-min-ttl (if set to an artificially high value will decrease the frequency with which requeries occur)
prefetch (if enabled, will increase the frequency of requeries)
rrset-cache-size (if constrained, limits the potential caching that can be done)

Nonetheless, the data can still be interesting, so we’ll show you how to compute this sort of list using a “Top N”-type list using a small Perl script.

III. MTBL Files

DNSDB Export customers receive DNSDB content in the form of MTBL files. A typical site will have a combination of yearly, monthly, daily, hourly, decaminutely, and minutely DNS (and/or DNSSEC) MTBL files.

These files “roll up,” with minutely files getting combined into decaminutely files, decaminutely files getting combined into hourly files, hourly files getting combined into daily files, daily files getting combined into monthly files and monthly files getting combined into yearly files. As part of that rollup process, the counts for “matching” entries (where “matching” = same RRname, same RRtype, same bailiwick, and same RRdata) get summed up.

A typical DNSDB daily file includes a little over 500 million observations and runs around 12 GB.

IV. Extracting FQDNs and Counts From An MTBL File: The dnstable_dump Command

The first step toward building a “Top N” list from a DNSDB MTBL file is extraction of a list of RRnames and associated counts. We’ll use the dnstable_dump command to access the MTBL files, asking for our output to be formatted in JSON lines format:

$ dnstable_dump -r /path-to-mtbl-files/dns.20181208.D.mtbl -j > temp.txt
$ more temp.txt
[...]
{"count":1,"time_first":1544216606,"time_last":1544216606,"rrname":"questar.ac.","rrtype":"A","bailiwick":"questar.ac.","rdata":["211.123.214.24"]}
{"count":2,"time_first":1544216606,"time_last":1544216606,"rrname":"questar.ac.","rrtype":"NS","bailiwick":"ac.","rdata":["dns01.gmoserver.jp.","dns02.gmoserver.jp."]}
{"count":3,"time_first":1544216606,"time_last":1544236520,"rrname":"questar.ac.","rrtype":"NS","bailiwick":"questar.ac.","rdata":["dns01.gmoserver.jp.","dns02.gmoserver.jp."]}
{"count":1,"time_first":1544216607,"time_last":1544216607,"rrname":"questar.ac.","rrtype":"SOA","bailiwick":"questar.ac.","rdata":["dns01.gmoserver.jp. hostmaster.questar.ac. 2018101605 28800 7200 604800 600"]}
[etc]

V. Using jq To Extract Just The Fields We Need From The Full JSON Lines Output

That dump contains a lot more data than we need — remember that we just need the “RRname” and the “Count” value for each record. We’ll use jq to extract just the bits we actually want to keep:

$ dnstable_dump -r /path-to-mtbl-files/dns.20181208.D.mtbl -j |\
jq -r '"\(.rrname) \(.count)"' > /working-dir-path/names-and-counts.20181208.txt

Note the somewhat arcane quoting and escaping required for this jq extract processing command to work… The required jq pattern is:

'		single quote
"		double quote
\		backslash
(		open parentheses
.rrname  	first value we want to print, the RRname
)		close parentheses
		space (could use some other delimiter here, like a comma, instead)
\		backslash
(		open parentheses
.count  	Second value we want to print, the associated count
)		close parentheses
"		double quote
'		single quote

You’ll also want to be sure you run jq with the -r ("raw") qualifier in order to get output sans double quote marks.

The output file from that run will look like the following, consisting of a FQDN followed by an associated count:

[...]
questar.ac. 3
questar.ac. 1
www.questar.ac. 3
railgun.ac. 19
railgun.ac. 42
railgun.ac. 19
railgun.ac. 21
www.railgun.ac. 1
ralston.ac. 1
ralston.ac. 7
ralston.ac. 1
ralston.ac. 1
www.ralston.ac. 5
www.ralston.ac. 1
[etc]

Note that “the same” FQDN may be shown multiple times — that’s a reflection of the fact that a given FQDN may appear multiple times in the file due to differing record types, differing bailiwicks, or differing Rdata values (you can see this in the raw JSON format extract shown in section IV). The file is also not sorted (or at least isn’t sorted the way one would normally expect). Nonetheless, that file will usually be fine for our purposes, at least if we’re happy working with FQDNs.

VI. Working With Effective-2nd Level Domains Instead of FQDNs

In some cases, however, you may not care about specific host names, you may just want to work with effective 2nd-level domains, instead. That is, rather than worrying about specific hostnames (such as “www.example.com”, “webmail.example.com” and “printer14.example.com”, etc.), you might want to simplify all those names to just “example.com.” [In a simple world, we could just use the last two parts of each name (e.g., example.com). Unfortunately, because some domains get registered “further down the domain hierarchy” (e.g., in the UK, many domains get registered under co.uk, and in the K12 primary/secondary education world here in the US, many domains have been registered under k12.statename.us, etc.), we need to use the more nuanced approach shown here].

We’ll assume that you’ve got data formatted as shown in section V. If that’s the case, and you wanted to work with just effective 2nd-level domains, you can just run your data through the little Perl filter shown:

$ cat e2ld-col1-only
#!/usr/bin/perl
use strict;
use warnings;
use IO::Socket::SSL::PublicSuffix;

my $pslfile = '/usr/local/share/public_suffix_list.dat';
my $ps = IO::Socket::SSL::PublicSuffix->from_file($pslfile);

my $fqdn = '';
my $value = '';

while ( <STDIN> ) 
   {
   my $line = $_;
   chomp($line);
   ($fqdn, $value) = split(/\s+/, $line);
   my $root_domain = $ps->public_suffix($fqdn,1);
   if ($root_domain)
      {
      printf( "%s\t%s\n", $root_domain, $value );
      }
   }

Note that the above filter assumes

You’ve got GNU LibIDN installed (e.g., libidn11-dev). If not (or you’re not sure), try # apt-get install libidn11-dev (or the equivalent for your operating system)
You’ve got Net::LibIDN installed. If not (or you’re not sure), try # cpanm Net::LibIDN to install it.
You’ve got IO::Socket::SSL::PublicSuffix installed. If not (or you’re not sure), try # cpanm IO::Socket::SSL::PublicSuffix to install it.
Our code assumes that you’ve got a copy of the public_suffix_list.dat file downloaded from https://publicsuffix.org/ and installed in /usr/local/share/public_suffix_list.dat
Finally, make sure you’ve made the script executable: $ chmod a+rx e2ld-col1-only

Run that filter by saying:

$ ./e2ld-col1-only < names-and-counts-20181208.txt > 2ld-only-20181208.txt &

Please be patient, this will likely take a bit on most systems (you’re converting 500 million+ domains)

VII. Getting To The Good Bit: Actually Tallying Hits of FQDNs or Effective 2nd-Level Domains in Perl

Imagine creating a table of FQDNs, with each FQDN having an associated counter that tallies the count for that FQDN. This is a natural fit for Perl’s “hash” data structure — at least as long as we can cram all the data to be tallied into available memory on the system being used . The Perl code to do this looks like:

$ cat top-test.pl
#!/usr/bin/perl 
use strict; 
use warnings;

my $line = '';
my %hash = (); 
my $key; 
my $value; 

my $results = '/working-dir-path/names-and-counts.20181208.txt';
open (my $in8, "<", "$results") or die "$results: $!";
while ($line = <$in8>) {
   # split out a pair of values (a FQDN and a count)
   ($key, $value) = split(/\s+/, $line);

   # increment the existing hash if it exists; create it if not
   if (defined ($hash{$key})) 
      { 
      $hash{$key}  += "$value"+0; 
      }
   else { 
      $hash{$key} = "$value"; 
      }
} # while loop

Notes related to that code snippet:

Be sure to set the correct hard-coded file path(s) in the script!
Perl hashes don’t “auto-initialize” to zero, so we need to check to see if the hash is already defined before attempting to add something to any given hash element (that’s why we need to use that “if (defined ($hash{$key}))” clause)
Adding zero to the value ensures that Perl treats that the value as numeric result rather than a string.
If you try running this or similar code and the program just seems to die while running, check /var/log/kern.log (or the equivalent for your system) to see if you’re running into an out-of-memory situation.
If you’d rather process the effective 2nd-level domain names, use 2ld-only-20181208.txt (the output file from the section 7 filter) instead of names-and-counts.20181208.txt
The above is just PARTIAL code — it loads the hash but doesn’t do anything with it once it’s loaded. We’ll talk about the rest of that in the next section (or check out Appendix I for a copy of the whole program)

VIII. Sorting the Hashed Names by Count; Filtering Any Stuff We Don’t Care About; Displaying Just The Top Names

One nice thing about Perl hashes is that it’s pretty easy to sort a hash by their keys (or by the hash values associated with those keys). If we didn’t mind just getting a sorted list of “all the names” in the hash, we could simply add another little block of code to our partial script from Section 5:

$j = 0;
foreach $key (sort {$hash{$b} <=> $hash{$a}} keys %hash) 
   { 
   $j++;
   print "\t$j\t$hash{$key}\ $key\n";
   }

If we then run the script from section 5 plus the above routine, we get results that look like:

    1       453805828 com.
    2       367717471 .
    3       109973072 net.
    4       66217141 fastly.net.
    5       63201283 org.
    6       60577676 in.
    7       57877709 biz.
    8       48031748 e.ns.nflxso.net.
    9       48030026 f.ns.nflxso.net.
    10      43177019 rs.
    11      40955192 a.ns.fna.fbcdn.net.
    12      40952435 b.ns.fna.fbcdn.net.
    13      36392883 me.
    14      35261662 pl.
    15      34353603 ru.
  [etc.]

If we were to print that entire table, it would contain millions of lines. We could look at that full set of results if we really wanted to, but in most cases we may only care about some “top hundreds” or perhaps some “top thousands” worth of results. For this article, we’ll limit what gets printed to something reasonable, perhaps 200 names.

We may also not care much about what I call “infrastructural FQDNs.” Infrastructural FQDNs can be thought of as FQDNs that unquestionably do important things, but which a typical end-user doesn’t directly visit. This include FQDNs such as:

The root domain (“.”)
1st-level name server domains (such as com, net, org, etc.)
“Effective 1st-level delegation points” (such as com.br, etc.)
Name server FQDNs (such as e.ns.nflxso.net, etc.)
Raw cloud and CDN-related FQDNs
Advertising-related and/or user-tracking-related FQDNs
Raw video-streaming-related FQDNs
Anti-malware- and anti-spam-related FQDNs, etc.

Those domains are like the plumbing and wiring of a typical house — the services that the infrastructure enables are great, but we don’t tend to think much about them (or directly visit them). They’re not “end-user-focused domains.”

We’ll add some Perl code that will let us filter out the infrastructural domains, too. To keep our filters manageable, we’ll use separate files for each of those categories. All-in-all, we found ourselves filtering about 750 “infrastructural” FQDNs:

369 nsKill.txt
187 arpaKill.txt
120 cloudKill.txt
44 cdnKill.txt
10 antiKill.txt
7 videoKill.txt
6 advertisingKill.txt
4 2ldKill.txt

We’ll spare showing you those here (they’re just lists of infrastructural domains, one per line)

Post filtering, we now get results that look like:

1    28126955 crunchyroll.com.
2    26632153 tumblr.com.
3    14736232 marriott.com.
4    12357527 mail.ru.
5    11482885 novline.ru.
6    9605338 mangosip.ru.
7    7968736 yf1.yahoo.com.
8    7967971 yf2.yahoo.com.
9    7191686 helpshift.com.
10    7121604 spotify.com.
11    6743730 xmidt.comcast.net.
12    6670091 app.link.
    [etc.]

Additional filtering may still be required (e.g., users normally don’t go to yf{1,2}.yahoo.com, xmidt.comcast.net, etc.)

IX. 2nd-level Effective Domains — An Easy Way To Spot DDoS Attack Traffic?

If we look at effective 2nd-level domains, that yields results that look like:

1    16625692 sina.com.cn
2    11194820 telkom.net.id
3    6749075 aridns.net.au
4    6491813 mega.co.nz
5    5384014 univie.ac.at
6    4990641 ctbc.com.br
7    3440836 twnic.net.tw
8    3402660 laveeneld.k12.az.us
9    3293029 yahoo.co.jp
10    3236428 ynkmptt.net.cn
11    3147646 ukraine.com.ua
12    2918084 ccgslb.com.cn
13    2744877 uol.com.br
14    2684019 ct.googleapis.com
15    2658765 hotnet.net.il
16    2634318 gesd.k12.az.us
[etc]

A couple of those names, stand out as being different from the others — these are a couple of K12 schools in Arizona:

8    3402660 laveeneld.k12.az.us
16    2634318 gesd.k12.az.us

Those seem pretty “busy” or “high traffic” for K12 sites… Spot checking DNSDB, we actually DO see relevant hits of a corresponding magnitude for those names; this is not some processing artifact in my code:

$ dnsdbq -r \*.gesd.k12.az.us -S -k count | less
;; record times: 2010-06-29 21:38:44 .. 2018-12-12 00:23:58
;; count: 379181114; bailiwick: k12.az.us.
dns2.gesd.k12.az.us.  A  12.159.65.138

;; record times: 2010-06-29 21:38:44 .. 2018-12-12 00:23:58
;; count: 379178658; bailiwick: k12.az.us.
gesd.k12.az.us.  NS  dns1.gesd.k12.az.us.
gesd.k12.az.us.  NS  dns2.gesd.k12.az.us.

;; record times: 2010-06-29 21:38:44 .. 2018-12-12 00:23:58
;; count: 379176834; bailiwick: k12.az.us.
dns1.gesd.k12.az.us.  A  12.159.64.132
[etc]

$ dnsdbq -r \*.laveeneld.k12.az.us -S -k count | less
;; record times: 2010-06-25 08:12:08 .. 2018-12-11 23:50:22
;; count: 566317720; bailiwick: k12.az.us.
laveeneld.k12.az.us.  NS  dns1.laveeneld.k12.az.us.
laveeneld.k12.az.us.  NS  dns2.laveeneld.k12.az.us.

;; record times: 2010-06-25 08:12:08 .. 2018-12-11 23:50:22
;; count: 566317398; bailiwick: k12.az.us.
dns1.laveeneld.k12.az.us.  A  63.229.61.147

;; record times: 2010-06-25 08:12:08 .. 2018-12-11 23:50:22
;; count: 566317363; bailiwick: k12.az.us.
dns2.laveeneld.k12.az.us.  A  209.234.221.50
[etc]

Have we found a couple of sites that are being victimized/abused or is this just a matter of some misconfiguration associated with these domains that’s resulting in anomalous traffic? In Part 2 of this blog, we’ll graph traffic volume over time to see what we can figure out about this phenomenon.

X. Security Considerations

IMPORTANT: The little demo/proof-of-concept Perl scripts shown in this blog article are meant to be run only by trusted internal users; we’ve made no attempt to “sanitize” inputs or make them “bulletproof.” Under NO condition should these be exposed to input data from untrustworthy sources. The code is also just “demo grade,” not production ready.

XI. Legal Caveat

DNSDB Export files may only be used in ways consistent with applicable contractual terms. In particular, DNSDB Export files are normally sold by Farsight for the exclusive use of the DNSDB Export customer ONLY. This means that you may NOT publish or otherwise reshare a work based on DNSDB Export data (such as a list of top Internet sites) without prior contractual authorization and licensing from Farsight Security, Inc.

XII. Conclusion

You’ve now seen how DNSDB Export customers can extract information from MTBL files and use that data to compute “Top N”-type lists of fully qualified domain names (“hostnames”) or “Top N”-type lists of effective 2nd-level domain names.

We’ve also identified some potentially-interesting effective 2nd-level domains that may merit further scrutiny. In part two of this series, we’ll look at how DNSDB Export users can potentially graph traffic volume over time to get a sense of WHEN anomalous traffic began and ended.

If you’d like to Farsight Security today about what DNSDB Export can do for you and your organization, please contact Farsight Security at [email protected].

Appendix I. Source code for top-test.pl

#!/usr/bin/perl

use strict;
use warnings;

my %hash = (); 
my $key; 
my $value;
my $line = '';

##### Load our hash with our input file
my $results = '/working-dir-path/names-and-counts-20181208.txt';

open (my $in8, "<", "$results")
     or die "$results: $!";

while ($line = <$in8>) {

   # split out a pair of values (a FQDN and a count)
   ($key, $value) = split(/\s+/, $line);

   if (defined ($hash{$key}))
      { 
        $hash{$key}  += "$value"+0;
      }
   else
      { 
        $hash{$key} = "$value";
      }

} # while results loop

close ( $in8 );

##### now find the top N

open my $in, '<', '/working-dir-path/nsKill.txt' or die $!;
chomp ( my @nsExclusions = <$in> );
close ( $in );
my $nsExclusionPattern = join '|', @nsExclusions;

open my $in2, '<', '/working-dir-path/2ldKill.txt' or die $!;
chomp ( my @twoLdExclusions = <$in2> );
close ( $in2 );
my $twoLdExclusionPattern = join '|', @twoLdExclusions;

open my $in3, '<', '/working-dir-path/cdnKill.txt' or die $!;
chomp ( my @cdnExclusions = <$in3> );
close ( $in3 );
my $cdnExclusionPattern = join '|', @cdnExclusions;

open my $in4, '<', '/working-dir-path/cloudKill.txt' or die $!;
chomp ( my @cloudExclusions = <$in4> );
close ( $in4 );
my $cloudExclusionPattern = join '|', @cloudExclusions;

open my $in5, '<', '/working-dir-path/advertisingKill.txt' or die $!;
chomp ( my @advertisingExclusions = <$in5> );
close ( $in5 );
my $advertisingExclusionPattern = join '|', @advertisingExclusions;

open my $in6, '<', '/working-dir-path/antiKill.txt' or die $!;
chomp ( my @antiExclusions = <$in6> );
close ( $in6 );
my $antiExclusionPattern = join '|', @antiExclusions;

open my $in7, '<', '/working-dir-path/videoKill.txt' or die $!;
chomp ( my @videoExclusions = <$in7> );
close ( $in7 );
my $videoExclusionPattern = join '|', @videoExclusions;

my $N = 200;
my $j = 0;
my $dotcount = 0;
my $teststring = '';

foreach $key (sort {$hash{$b} <=> $hash{$a}} keys %hash) {
   if ($j < $N) 
      {
	 # don't print bare TLDs ($dotcount must be > 1)
	 $teststring = $key;
         $dotcount = ( $teststring =~ tr/.// );

         if (($key ne '.') && 
             ($dotcount > 1) && 
             (not($key =~ $nsExclusionPattern)) &&
             (not($key =~ $twoLdExclusionPattern)) &&
             (not($key =~ $cdnExclusionPattern)) &&
             (not($key =~ $cloudExclusionPattern)) &&
             (not($key =~ $advertisingExclusionPattern)) &&
             (not($key =~ $antiExclusionPattern)) &&
             (not($key =~ $videoExclusionPattern)))
            { 
            $j++;
            print "\t$j\t$hash{$key}\ $key\n"; 
            } # if non-infrastructural
       } # if close enough to the top

    if ($j >= $N) { die "\n"};

    } # for each

Joe St Sauver Ph.D. is a Distinguished Scientist with Farsight Security, Inc.

Read the next part in this series: Volume-Over-Time Data From DNSDB Export MTBL Files (Part Two of Three-Part Series)