
“… He shall separate them one from another, as the shepherd separateth the sheep from the goats…” Matthew 25:32 (Douay-Rheims Edition)
We can split the domain names we see in passive DNS traffic according to many different taxonomies. For example, we can talk about:
Today we’re going to talk about a different way to categorize domain names:
$ dnsdbq -r 0vkcr4bccrjhivmsodse7h90d9ns9851.com/RRSIG -A1d -l 1
;; record times: 2019-03-16 05:19:46 .. 2019-03-20 05:20:42
;; count: 6593; bailiwick: com.0vkcr4bccrjhivmsodse7h90d9ns9851.com. RRSIG NSEC3 8 2 86400 1553318330 1552709330 16883 com. fCRQh2uMY29qtm75l5t3lTjnTLMxjP2ptz2OX6oZImHW+hkKmJvK4B+9 pSZHtudReziEUmkJnUKl6ZLLe07nSTXRxcNvkjM5rImW7eQH9rkpv2cd I0ajdLESUrkisB4zWrBJvDWXKyR1SGTtdftXcInPHn23Exd/flaR1WcK 6hk=
While that’s a “real” RRSIG record, you will not find a corresponding “A”/”AAAA”/”CNAME” record in passive DNS, nor anything in Whois. We’re not interested in hashed NSEC3/RRSIG names today, either.
scuba-snowshovel-saute-snapdraggon[dot]com.When it comes to identifying algorithmic domain names, most analysts “know them when they see them.”
That’s great if you’re just scanning a short list of domains, but doesn’t scale well when confronting lists of hundreds of thousands or millions of candidate domain names. A detection algorithm is needed so the process can be automated. Random-looking domains will normally NOT include recognizable English words. Therefore, the core of one heuristic for spotting algorithmic domain names would be to look for domain names that contain zero embedded English words.
In using that rule-of-thumb, obviously we’ll be disregarding other conceivable types of algorithmically-constructed domain names (such as our
scuba-snowshovel-saute-snapdraggon[dot]com
domain). That sort of domain name, while algorithmic, would obviously fail to be detected by our “no English words in the domain name” heuristic. We could create an alternative heuristic that WOULD be able to recognize domain names created by concatenating English words together, but for now, we’re just going to focus on random-looking algorithmic domains that don’t contain English words, and ignore other algorithmic domain name construction algorithms.
We can refine our core heuristic further. For example:
We’ll explore some of these areas later in this article.
We also want to be sure to acknowledge up front that there has been significant formal academic work around algorithms for detecting domain generation algorithms (“DGAs”). It is not our intention to try to “compete” with that formal work, nor to tie the names we find to particular DGAs.
Rather, our goal today is just to demonstrate a pragmatic approach to finding algorithmic-looking domain names in “real-life traffic” as seen on Farsight Security SIE Channel 204.
In order to identify embedded words, we’ll need a “vocabulary list” of English words. One source for English words is SCOWL, the Spell Checker Oriented Word Lists and Friends. We’ll use the web interface to actually create our list:

Figure 1. Settings For Our Downloaded List of Words
After downloading the list we’ve just created, we’ll untar it and check to see how many words it contains:
$ tar xfv SCOWL-wl.tar.gz
$ cd SCOWL-wl
$ wc -l words.txt
123601 words.txt
Reviewing the words in the
words.txt
file, we immediately noticed that it contained words with apostrophes and that some of the words (e.g., proper nouns) were mixed case. Since domains can’t contain words with apostrophes, we dropped all words with apostrophes. And because domain names are case insensitive, we also forced all the words to be lowercase only, thereby simplifying later comparisons. After sorting and uniquifying what remained, our list was reduced to roughly 88,000 words:
$ grep -v "'" words.txt > words2.txt <-- double quote, tick mark, double quote
$ tr '[:upper:]' '[:lower:]' < words2.txt | sort -u > words.txt
$ wc -l words.txt
87881 words.txt
For comparison, it has been reported that a typical 20-year-old American knows about 42,000 words, so we’d expect this list to encompass most commonly used English words (and then some). If that list is too inclusive (or not inclusive enough), you can obviously tweak the “SCOWL size” variable in the web form to get a larger or smaller list to suit your preferences. We might also want to supplement that list with a list of common Internet/computing/networking-related “jargon” terms, and a list of common abbreviations and acronyms.
In addition to a list of English words, we also need a program to actually look for words in text. This article provides a nice conceptual starting point, although we’ll need to modify that algorithm for our unique requirements. To meet our specific needs, our Python program will need to:
What domains might be “out of scope” and “subject to filtering?” Well, we’ll:
ip6.arpa or in-addr.arpa
00112233445566778899[dot]tk)For example after processing, the words found in the domain name
bellsouthyahoo[dot]com
were:
9, 14, bellsouthyahoo, com, bell, bells, ell, ells, out, sou, south, thy, yahoo
The
bellsouthyahoo[dot]com
name does NOT look like a “random” algorithmic domain even though it is 14 characters long because it has 9 identifiable words extracted from the 2nd-level label.
Contrast that with a name that DOES look random, in which NO identifiable words were found, such as:
0, 34, aa625d84f1587749c1ab011d6f269f7d64, com
You can see the code we came up with for this in Appendix I.
As much as we want to find algorithmic domain names that lack English words, we also want to carefully avoid any popular “real” domain names. Avoiding those will give us insurance against most potential “false positives”:
Figure 2. False Positives and False Negatives
Well-defined “popular” domains are ones that are on any of a number of “top N” lists such as the Alexa One Million Top Domains, the Majestic Top One Million domains, or the Domcop Top 10 Million domains.
The article, “Comparison Of Alexa, Majestic & Domcop Top Million Sites” does a nice job of laying out how to retrieve and extract domains from those lists.
To prepare a consolidated white list from those raw lists, we’ll:
This combined list has 8,650,286 domains which we’ll define as “exempt” from being tagged as “random-looking” algorithmic domain names. Does this leave us open to potentially missing a true “random-looking” algorithmic domain name that’s somehow also “popular?” Yes. There are definitely some pretty “random-looking” domains in that combined list, but we believe the value of protecting against accidentally mis-including popular domains outways the risk of having “false positives.”
Having developed code to find at least one class of algorithmic domains, we now need to test our code and “whitelist” on some real data (our “test corpus”).
To build a test corpus for initial assessment, we logged into a blade server at the Security Information Exchange (SIE) which has access to Channel 204. We then pulled ten million RRnames and RRtypes with the command:
$ time nmsgtool -C ch204 -c 10000000 -J - | jq --unbuffered -r '"\(.message.rrname) \(.message.rrtype)"' > ten-million-domains-to-test.txt
Because Channel 204 is a fairly busy channel, it only took a little over five minutes to pull that sample — the time command reported:
real 5m11.123s
user 6m4.932s
sys 0m29.812s
The output from our program consisted of a list of fully qualified domain names and associated record types:
$ more ten-million-domains-to-test.txt
phaodaihotel.com[dot]vn. A
cleancarpetandduct[dot]com. SOA
phaodaihotel.com[dot]vn. NS
cleancarpetandduct[dot]com. NS
imap.klodawaparafia[dot]org. CNAME
app[dot]link. AAAA
villamarisolpego[dot]com. NS
www.synapseresults.com.cdn.cloudflare.net. A
roadvantage[dot]com. SOA
plefbn59gl6bvbdp21haguljb51mbp7c[dot]nl. NSEC3
i6pangjmmi77sn4bt560edcnft2877a7[dot]by. NSEC3
i6pangjmmi77sn4bt560edcnft2877a7[dot]by. RRSIG
plefbn59gl6bvbdp21haguljb51mbp7c[dot]nl. RRSIG
gamamew-org[dot]cf. NS
egauger[dot]com. NS
[etc]
Note that some of the “random looking” domain names in that brief sample are actually hashed DNSSEC-related records (NSEC3 and RRSIG record types) as mentioned in Section I. We’ll exclude NSEC3 records and RRSIG records by saying:
$ grep -v " NSEC3" ten-million-domains-to-test.txt | grep -v " RRSIG" > ten-million-domains-to-test-without-dnssec.txt
$ wc -l ten-million-domains-to-test-without-dnssec.txt
8573184 ten-million-domains-to-test-without-dnssec.txt
For the curious, there were 435,709 NSEC3 records and 991,107 RRSIG records excluded by that filtering.
We’re also going to exclude internationalized domain names (“IDNs” have Punycoded names beginning with
xn--
). We’ll also drop names that end in
\.arpa\.
(this last filter is meant to catch any
ip6.arpa.
or
in-addr.arpa.
entries). We can also drop the record type data column, since we’ll no longer need it to filter NSEC3 or RRSIG records:
$ grep -v "xn--" ten-million-domains-to-test-without-dnssec.txt | grep -v "\.arpa\. " | awk '{print $1}' > ten-million-domains-to-test-without-dnssec-idn-or-arpas.txt
$ wc -l ten-million-domains-to-test-without-dnssec-idn-or-arpas.txt
8032040 ten-million-domains-to-test-without-dnssec-idn-or-arpas.txt
There were 41,101 internationalized domain names and 500,043 names that ended in
\.arpa\
.
We’ll then reduce the remaining domains to only their delegation points (for example, www.example.com –> example.com) with a popular little perl script called
2nd-level-dom-large
that reads and processes those domains record-by-record (see Appendix II):
$ 2nd-level-dom-large < ten-million-domains-to-test-without-dnssec-idn-or-arpas.txt > ten-million-domains-to-test-2nd-level.txt
Having dumped the hostname part of the fully qualified domain names, we may now potentially have duplicate entries. We’ll deduplicate those domains by sorting and uniq’ing them:
$ sort ten-million-domains-to-test-2nd-level.txt > ten-million-domains-to-test-2nd-level-sorted.txt
$ uniq < ten-million-domains-to-test-2nd-level-sorted.txt >
sorted-and-uniqed-ten-million.txt
$ wc -l sorted-and-uniqed-ten-million.txt
3566037 sorted-and-uniqed-ten-million.txt
We were then ready to test our find-a-word procedure. The Python code in Appendix I has been set to only test for English words that are equal to or greater than three characters in length. Some additional filtering is also done; see the
find-words.py
code. This takes our effective number of domains for analysis down to just 3,131,227 domains:
$ ./find-words.py < sorted-and-uniqed-ten-million.txt > ten-million-output.txt
$ gsort -t',' -k1n -k2nr < ten-million-output.txt > sorted-ten-million-output.txt
$ uniq sorted-ten-million-output.txt > sorted-and-uniqed-ten-million-output.txt
$ wc -l sorted-and-uniqed-ten-million-output.txt
3131227 sorted-and-uniqed-ten-million-output.txt
The distribution of 3,131,227 2nd-label lengths (measuring the lengths in characters of just the effective 2nd-labels) looks like:

Figure 3. Distribution of 2nd-Label Lengths
A copy of the
R
program used to create that graph can be found in Appendix III.
The distribution of the number of words found per effective 2nd-label is a nice measure of the inclusivity of our word list. If our word list does a good job of capturing all the words that people use when creating domain names, we should see relatively few “zero words found” domain names except for truly random-looking junk. We’d also expect that a relatively large number of words would be able to be found embedded in at least some names.

Figure 4. Number of Embedded Words Found in Effective 2nd-level labels
Here are some selected tabular “cut points” from the sorted 3,131,227-line output file from our “English words-only” run:
# of Words Starting Cumulative % Words In % of
Found On Line of Complete List This "Slice" Complete List
0 1 0% 406,841 12.99%
1 406,842 12.99% 286,664 9.16%
2 693,506 22.15% 297,230 9.49%
3 990,736 31.64% 307,051 9.81%
4 1,297,787 41.45% 293,268 9.37%
5 1,591,055 50.81% 270,360 8.63%
6 1,861,415 59.45% 238,653 7.62%
7 2,100,068 67.07% 206,452 6.59%
8 2,306,520 73.66% 173,525 5.54%
9 2,480,045 79.20% 145,115 4.63%
10 2,625,160 83.84% 384,083 12.26%
15 3,009,243 96.10% 98,280 3.13%
20 3,107,523 99.24% 19,508 0.62%
25 3,127,031 99.87% 3,522 0.11%
30 3,130,553 99.98% 656 0.02%
40 3,131,209 99.999% 18 0%
47 (max) 3,131,227 100.00%
The largest number of “embedded words” found was 47, for the vaguely-threatening-sounding second-label string
howdoesdonaldtrumpsaythesethingsandnotgethospitalized
:
`47, 53, howdoesdonaldtrumpsaythesethingsandnotgethospitalized, com, ali, and, doe, does, don, dona, donald, ese, ethos, get, gsa, hes, hing, hings, hos, hosp, hospital, hospitalize, hospitalized, how, ing, ital, liz, not, pit, pita, rum, rump, rumps, san, sand, say, set, seth, spit, tali, the, these, thin, thing, things, tho, trump, trumps, ump, umps, zed`
Digging into the 3,131,227 observations in
sorted-and-uniqed-ten-million-output.txt
, we can see that our code successfully identified many zero-embedded-English-words algorithmic-looking names, including:
0, 20, 3194df72cec99745647d, date
0, 20, 4cb619a293b854c7269a, date
0, 20, 58e20e431fd16ecea0cc, date
0, 20, 5cac157ee96e87a7bb59, date
0, 20, 605b4274b01e6840e84b, date
0, 20, 65a0163dddfbd7030f48, date
0, 20, 65v1d7rixqtu6ixg65hm, ws
0, 20, 6d69c6ff6eb7e8facf18, date
0, 20, 88f0289c6f7e1f0b3cf9, date
0, 20, 9111304db6c05e5473a8, date
0, 20, 99990002188338d9dc5e, date
0, 20, 9ee08f9b1e36e1a3f880, date
0, 20, ab032283869929420167, win
0, 20, ab032302933987702253, win
Interpreting the small set of observations just mentioned:
We can use
to confirm that those domains really DO exist in DNSDB… For example, asking to see just a single “A” answer (RRset) for 3194df72cec99745647d[dot]date:
$ dnsdbq -r 3194df72cec99745647d[dot]date/A -l 1
;; record times: 2018-03-08 00:40:40 .. 2019-03-06 13:09:49
;; count: 425; bailiwick: 3194df72cec99745647d[dot]date.
3194df72cec99745647d[dot]date. A 104.28.6.141
3194df72cec99745647d[dot]date. A 104.28.7.141
That random-looking domain happens to be currently hosted on/through Cloudflare.
a) “Raw” TLDs: DNSDB data can include names with just one label (e.g., a TLD), or with two labels (a normal delegation point), and names with three or more labels (typically “fully-qualified domain names”).
We saw 850 raw effective-top-level domains. These can readily be identified by noting the “0, 0, ,” start to each of the relevant lines (e.g., the second-level part was of missing/of “zero length”, and obviously we won’t find any embedded words in a zero-length second-level label).
[...]
0, 0, , asia
0, 0, , asn.au
0, 0, , associates
0, 0, , at
0, 0, , attorney
0, 0, , auction
0, 0, , audio
0, 0, , auto
0, 0, , avianca
0, 0, , aw
0, 0, , ax
0, 0, , az
0, 0, , azure
[etc]
b) Domains With Extremely LOW Shannon Entropy Values:
While most of the “zero words found” domains truly did look random, there were some “zero-words-found” domains that nonetheless had readily-discernible-to-a-human-analyst NON-random patterns, such as runs of a single characters. These labels would have low levels of Shannon Entropy. To get a handle on this, we modified our Python code to compute and prepend the Shannon Entropy value associated with the “second-label” to each line (see Appendix IV). For example:
[Shannon Entropy, Words Found, second-level label length, 2nd-level label, TLD]
0, 0, 7, jjjjjjj, cn
0.6500224216483541, 0, 6, 55555f, com
0.8112781244591328, 0, 8, z666z666,com
0.9709505944546686, 0, 5, ggvvg, cn
Shannon entropy values less than one are quite uncommon for domain second-level labels, at least based on our sample:

Figure 5. Shannon Entropy Distribution for the 2nd-Label of the Test Corpus
c) Domains with “Dashed” Names:
We also noticed some domains with large numbers of dashes, such as:
2.766173466363126, 0, 62, 1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24, com
1.191501106916491, 0, 54, 0-0-0-0-0-0-0-0-0-0-0-0-0-33-0-0-0-0-0-0-0-0-0-0-0-0-0, info
1.228538143953528, 0, 54, 0-0-0-0-0-0-0-0-0-0-0-0-0-53-0-0-0-0-0-0-0-0-0-0-0-0-0, info
2.6081816167087406, 0, 23, h-e-r-n-a-n-c-o-r-t-e-s, com
2.695138138447871, 0, 23, f-i-v-e-d-i-a-m-o-n-d-s, com
2.5358577759182794, 0, 19, t-i-p-o-g-r-a-f-i-a, ru
2.641120933813016, 0, 19, m-y-w-e-b-p-o-k-e-r, eu
2.395998870534841, 0, 17, p-o-r-t-f-o-l-i-o, ru
2.5580510765444573, 0, 17, b-o-o-m-e-r-a-n-g, com
2.5580510765444573, 0, 17, p-r-e-j-u-d-i-c-e, com
1.9237949406953985, 0, 16, i--n--s--u--r--e, com
2.4956029237290123, 0, 14, ip-167-114-114, net
2.495602923729013, 0, 14, ip-144-217-111, net
2.495602923729013, 0, 14, ip-167-114-116, net
2.556656707462823, 0, 14, ip-144-217-114, net
2.610577243331642, 0, 14, ip-217-182-228, eu
2.6384600665861555, 0, 14, ip-167-114-119, net
2.6384600665861564, 0, 14, ip-149-202-222, eu
These patterns obviously suggest that there might be filtering value to heuristics that tally the number of dashes in a domain name, or to stripping dashes as a preliminary pre-processing step.
Checking the data, and exempting names exhibiting a “potentially-normal” number of dashes (e.g., zero through three dashes) from analysis, we see a distribution for the remaining names that looks like:

Figure 6. Distribution of Dashes Found in Effective Second-Labels (excluding names with 0-3 dashes)
Given the existence of domains with this sort of atypical dash usage, one could envision a pre-processing rule that strips all dashes prior to searching the transformed name for words, although we’re not doing so at this time.
d) Non-English Words:
In other cases, inspection of the output makes it clear that if we’d included dictionaries for languages other than English, we would have detected foreign-language words. For example:
0, 23, trikolor-tv-elektrougli, ru
0, 20, muehlheim-aerztehaus, de
0, 16, ce-vogfr-aulnoye, fr
0, 15, buetler-elektro, ch
0, 15, cipriettigiulia, it
Obviously our initial focus on only English words was perhaps inappropriately provincial. The Internet is international, and has users who rely on French, German, Italian, Portuguese, Russian, Spanish, and every other language you can imagine.
What lists of foreign words can we find? First, note that we only want Romanized/Latin word lists (no Arabic script word lists, no Chinese/Japanese/Hangul script word lists, no Cyrillic word lists, etc.)
One potentially-rich source of Romanized foreign language words is the lists of foreign language words compiled for auditing password practices with password cracking software. See, for example, the 20+ foreign language word lists here We’ll assume that you’ve downloaded and merged those word lists, removing any words that include characters other than letters/digits/hyphens from the combined list.
After adding the new word lists and rerunning our
find-words.py
script with our new combined English-plus-foreign word list, we find far more embedded words, as reflected in FEWER domains with just a small number of words found.
For example, as shown in the following table, when we had only English words to look for in domain names, there were 406,841 domains where we found no words. Changing the word list to be a combination of English AND foreign language words, and suddenly we’re down to only 94,011 “domains with no embedded words,” aka random-looking algorithmic names.
Words ENGLISH WORDS ONLY COMBINED LANGUAGE LIST
Found Starting On % of Lines Starting On % of Lines DIFFERENCE
0 1 0% 1 0%
1 406,842 12.99% 94,012 3.00% 9.99%
2 693,506 22.15% 213,816 6.83% 15.32%
3 990,736 31.64% 329,519 10.52% 21.12%
4 1,297,787 41.45% 455,640 14.55% 26.90%
5 1,591,055 50.81% 566,168 18.08% 32.73%
6 1,861,415 59.45% 677,469 21.64% 37.81%
7 2,100,068 67.07% 807,436 25.79% 41.28%
8 2,306,520 73.66% 929,783 29.69% 43.97%
9 2,480,045 79.20% 1,062,576 33.93% 45.27%
10 2,625,160 83.84% 1,196,525 38.21% 45.63%
15 3,009,243 96.10% 1,822,685 58.21% 37.89%
20 3,107,523 99.24% 2,311,625 73.82% 25.42%
25 3,127,031 99.87% 2,648,053 84.57% 15.30%
30 3,130,553 99.98% 2,862,546 91.42% 8.56%
40 3,131,209 99.999% 3,060,974 97.76% 2.24%
47 3,131,227 100.00% 3,106,213 99.20% 0.80%
50 3,115,421 99.50% 0.50%
60 3,127,871 99.89% 0.11%
70 3,130,530 99.98% 0.02%
80 3,131,044 99.99% 0.01%
90 3,131,185 99.999% 0.001%
100 3,131,213 99.999% 0.001%
116 (max) 3,131,227 100% 0%
This is a significant improvement over the English-only word list. See the green-shaded region in the following graph:

Figure 7. Improvement in Word Discernment, English-Only to Combined Language Word List
For the curious, the domain name containing the largest number of English-OR-foreign-language words was:
116, 59, escuelaprimariaurbanacrescenciocarrilloyanconaturnomatutino, com, acr, acre, acres, ana, anac, anc, anco, ancon, ancona, apr, apri, aprimar, ari, aria, arr, arri, atu, atut, aur, ban, bana, banac, car, carr, carri, carril, carrillo, cen, cenci, cencio, cio, con, cona, conatu, cre, cresce, cue, ela, enc, esc, esce, escuela, iau, iaur, ill, illo, ima, imar, imari, ino, ioc, ioca, lap, llo, loy, mar, mari, maria, mat, matu, matuti, matutino, nac, nacre, nacres, nat, natu, natur, nci, nco, nom, noma, oca, oma, omat, ona, onatu, oya, pri, prim, prima, primar, primari, primaria, rba, res, ria, riau, ril, rill, rim, rima, rimar, rimari, rno, rri, sce, scen, scu, tin, tino, tur, turn, turno, tut, tuti, tutin, uel, urb, urba, urban, urbana, urn, uti, utino, yan
The final code that you see in Appendix IV considers only effective second-level domains, and filters out:
• RRSIG and NSEC3 pseudo domains
• Domains found in the consolidated whitelist
• IDN domains (“xn--“)
• ip6.arpa and in-addr.arpa domains
• Domains of the form ip-–-*
• Names with four or more dashes
• Domains with all numeric 2nd-labels
• Domains with one or more embedded English or foreign-language words
• Domains with 2nd-labels that have Shannon entropy <= 1
• Domains with 2nd-level labels 7 characters or less in length
Looking at what’s left (sorted in descending order by Shannon entropy) we see 5337 domains that look like:
4.133 6r5re2zfxghh3z-yf4um-w6-k9 site
4.031 5mc6xkw738bw77ggk5ff5crx8u9xsd33 biz
4.005 54re3kz9j-da5pjcf9x2ez work
3.962 r723f875e51fe4a1ec905d6bd2a42363be ws
3.94 r21037c6a95b8642e87adfb4a8c30f3302 ws
3.925 xc397b28a0e906415526a5d248760516fd ws
3.903 v316bf9146c5a89b21a40f8e43de502e1a ws
3.902 9dyfa9us3wo9exdw2di3966ypq download
3.856 w4fb1843f970915bd98e2a76d7f56e99ae to
3.856 ta0f52cf79107c18dcddb2bb113a641c7e ws
3.856 jb1f8dd9ec2e548025b132cf7a91ef11a0 ws
3.855 mt425nf-s9w3s9m-b79em2ea site
3.852 l3b01f3b6ea53dc7df1b8a57013e44c91b ws
3.844 te736da94645b8f25d92d868044157c8d2 ws
3.843 f73ac86f4014845e5edd0d81cb339cc2 top
3.83 tde595bfc08466e174bf33b445671b0a58 ws
3.827 u2a1376gf-43ty-245c com
3.822 z67c9434a799a225e6937b835da0eb06c8 ws
3.822 r8f49056003f8ceddf9d83952b56097a74 ws
[…]
2.406 88826cne com
2.406 7hx444at com
2.406 7hs1333m xyz
2.406 788js8f2 cn
2.406 73jdyyyl com
2.406 68333cne com
2.406 666024xc com
2.406 63a87770 com
2.406 617yl888 com
2.406 58333vns com
2.406 518988cp com
2.406 444360cp com
[…]
1.061 gb333333 com
1.061 cy888888 com
1.061 av888888 com
1.061 889-8888 com
1.061 666ht666 com
1.061 500w0000 com
1.061 444444kj com
We could now look for commonalities among those domains such as shared hosting IPs, common nameservers, registration via the same registrars, etc. In part two of this series, we’ll dig in on the domains we’ve found.
You’ve now seen how a naive model, crude as it may be, can still help us to find algorithmic domain names. Can we do better? We’ve hinted at some things we can try, but you’ll have to wait for the next part of this series to see exactly where else this analysis will take us.
In the mean time, why don’t you talk with a Farsight Sales Executive about access to Channel 204 at the Security Information Exchange? With access to Channel 204, you could see how your own ideas for spotting algorithmic domain names measure up!
The Farsight Security Sales Team can be reached at [email protected]
words.py (in Python 2)#!/usr/local/bin/python2
import string
import sys
import re
import tldextract
## read in the dictionary
## we will compare words found in stdin against words in this "dictionary"
dictionary = set(open('words.txt','r').read().split())
max_len = max(map(len, dictionary))
## read in the combined list of popular domains we want to avoid
protected = set(open('combined-millions-unique.txt','r').read().split())
max_len_p = max(map(len, protected))
## ensure the TLD extractor knows to obey the Public Suffix List
extract = tldextract.TLDExtract(include_psl_private_domains=True)
extract.update()
## read in candidate domain names for processing
for line in sys.stdin:
## create an initial empty set
mydomainnames = set()
justfound = line.lower().rstrip()
is_not_whitelisted = not(justfound in protected)
is_not_an_idn_name = (justfound.find("xn--") == -1)
is_not_an_ip6_arpa_name = (justfound.find("xn--") == -1))
is_not_an_in_addr_arpa_name = (justfound.find("in-addr.arpa") == -1)
if (is_not_whitelisted and
is_not_an_idn_name and
is_not_an_ip6_arpa_name and
is not_an_in_addr_arpa_name):
extracted = tldextract.extract(justfound)
justfound = extracted.domain.replace(" ","")
if (justfound.isdigit() == False):
justfound_length = len(justfound)
if (justfound_length >= 8):
efftld = extracted.suffix
words_found = set() #set of words found, starts empty
for i in xrange(len(justfound)):
chunk = justfound[i:i+max_len+1]
for j in xrange(1,len(chunk)+1):
word = chunk[:j] #subchunk
if word in dictionary:
if len(word) > 2: words_found.add(word)
words_found = sorted(words_found)
number_of_words = len(words_found)
print str(number_of_words)+", "+str(justfound_length)+",",
if number_of_words > 0:
print justfound+", "+efftld+",",
else:
print justfound+", "+efftld
if number_of_words > 0:
print str(words_found).replace("[","").replace("]","").replace("'","")
#!/usr/bin/perl
use strict;
use warnings;
use IO::Socket::SSL::PublicSuffix;
my $pslfile = '/usr/local/share/public_suffix_list.dat';
my $ps = IO::Socket::SSL::PublicSuffix->from_file($pslfile);
while (my $line = <STDIN>) {
chomp($line);
my $root_domain = $ps->public_suffix($line,1);
printf( "%s\n", $root_domain );
}
#!/usr/local/bin/Rscript
mydata <- read.table(file="word-length-counts.txt",header=T)
pdf("word-length-distribution.pdf", width = 10, height = 7.5)
library("ggplot2")
library("scales")
mytitle <- paste("\nDistribution of 2nd-Label Lengths\n", sep = "")
theme_update(plot.title = element_text(hjust = 0.5))
p <- ggplot(mydata, aes(length)) +
geom_histogram(breaks=seq(-0.5, 63.5, by = 1)) +
labs(title=mytitle, x="\nLength in Characters", y="Count\n") +
scale_y_continuous(labels = function(x) format(x, big.mark = ",",
scientific = FALSE)) +
theme(plot.margin=unit(c(0.5,0.5,0.5,0.5),"in"))
print(p)
words.py with combined language word list and Shannon Entropy Code (in Python 3)#!/usr/local/bin/python3
import string
import sys
import re
import tldextract
import math
def entropy(s):
string_length = len(s)
lc = {letter: s.count(letter) for letter in set(s)}
temp=0.0
for i in lc:
temp=temp+((lc[i]/string_length) * math.log2(lc[i]/string_length))
if temp == 0.0:
return 0
else:
return -temp
## read in the dictionary
## we will compare words found in stdin against words in the dictionary
dictionary = set(open('combined-word-list.txt','r').read().split())
max_len = max(map(len, dictionary))
## read in the combined list of popular domains we want to avoid
protected = set(open('combined-millions-unique.txt','r').read().split())
max_len_p = max(map(len, protected))
## ensure the TLD extractor knows to obey the Public Suffix List
extract = tldextract.TLDExtract(include_psl_private_domains=True)
extract.update()
## read in candidate domain names for processing
mydomainnames = set()
for line in sys.stdin:
## ensure name is lowercase and is whitespace free
justfound = line.lower().rstrip()
## only process stuff NOT found in the whitelist
if not(justfound in protected):
## ANY of the following hit? Don't process it
## note that at this point the TLD is still there
## Start by checking for punycoded domains
if ((justfound.find("xn--") == -1) and
## and .arpa domains
(justfound.find("ip6.arpa") == -1) and
(justfound.find("in-addr.arpa") == -1) and
## and anything like ip-12-34-56
(not(re.search("ip-*-*-*",justfound))) and
## any names with 4 or more dashes
(justfound.count('-') <= 3)):
## now pull just the 2nd-label
extracted = tldextract.extract(justfound)
justfound = extracted.domain.replace(" ","")
## only keep 2nd-labels that have at least one non-numeric
if justfound.isdigit() == False:
## how long is the 2nd-label?
justfound_length = len(justfound)
## what's the 2nd-label's shannon entropy?
shan_ent = entropy(justfound)
efftld = extracted.suffix
words_found = set() #set of words found, starts empty
for i in range(len(justfound)):
chunk = justfound[i:i+max_len+1]
for j in range(1,len(chunk)+1):
word = chunk[:j] #subchunk
if word in dictionary:
if len(word) > 2: words_found.add(word)
words_found = sorted(words_found)
number_of_words = len(words_found)
## these should be our highly random names
if ((number_of_words == 0) and (shan_ent > 1) and
(justfound_length >= 8)):
rounded_shan=round(shan_ent,3)
print(rounded_shan, justfound, efftld)
Joe St Sauver Ph.D. is a Distinguished Scientist with Farsight Security®, Inc.
Read the next part in this series: ‘A Second Approach To Automating Detection of “Random-Looking” Domain Names: Neural Networks/Deep Learning’