That which we call a rose by any other name would smell as sweet. (Act 2, Scene 2)
How do you split a domain name into words? That seems like a simple question, but the answer is actually complicated and abounds with caveats and limitations. Computers don’t (by default, anyway) have the language context that humans do, so they aren’t able to look at a mass of letters and instinctively pull out the words that make up that jumble. So when a person sees the domain name
mylongjohnsilversexperience.com, most English speakers will be able to identify that it’s “My Long John Silvers Experience” fairly quickly. A computer, on the other hand, will rely on assumptions, algorithms, and statistics to split that domain name into words — and even then there’s no guarantee it will get the same split of words as a human.
Why would you want to do this in the first place, though? There are a few obvious reasons, and some not-so-obvious ones:
- Brand monitoring. Companies want to see if their brand is being impersonated in domain names online. This is comparatively easy for Microsoft, because their name is unique, but much harder for
- In our case we find it useful to see if certain words are trending in popularity across the entire internet. Looking at trending words (we call them Domain Blooms) allows us to see defensive registrations, fraud campaigns, etc.
- Once you have the words that make up a set of domains, you can start to build on that with machine learning methods like clustering to find natural groupings of domains. A number of people have talked about this, including us. DomainTools released a webinar about extracting domain names and clustering the extracted names together in October, 2022.
So we have “what” (splitting strings into words), and we have “why” (brand monitoring and trending words). Let’s talk about “how.”
Searching for Specific Terms Using Pattern Matching
O teach me how I should forget to think. (Act 1, Scene 1)
If you are looking for specific terms, e.g., for brand monitoring, the most obvious way to do it is with a simple pattern matching algorithm. This can be as simple as checking whether a specific string is contained in the string you’re examining, or it can be more complex using regular expressions. In either case, you are creating a pattern to check the string against—for example, you might make a regular expression like
airplane[s], which would match the words
ilikeairplanes.com would match, as would
Pattern matching works well if you know the exact string you are looking for and are not concerned terribly with what the other words or phrases are in the string. It is also very easy to reason about: There is no complex processing involved unless you make your regular expression super-complicated (if you find yourself building a super-complicated regular expression, it may be time to move to another algorithm anyway).
There is a problem with pattern matching, however. Static pattern matching can find words in strings, but the “word” it finds may not be a word: The word may actually cross word boundaries in a proper splitting of the string or be entirely contained inside a different word. The
airplanet domain above is an example of this: The overall string does contain
airplane, but that’s not how a human would read that domain name (more on this below—the
mylongjohnsilversexperience example wasn’t chosen by accident).
Breaking Apart a String Into N-Grams
Parting is such sweet sorrow. (Act 2, Scene 2)
If you need to get more complicated than pattern matching, your next option could be n-grams.
What’s an n-gram? N-grams are a fancy way of saying “split this string into pairs/triples/quadruples/quintuples/etc. in order.” So a 2-gram (also called a bigram) split of a string like
mylongjohnsilversexperience would return 2-letter chunks like
ng, and so on. A 3-gram (or trigram) split does the same thing but into 3-letter chunks like
This technique has the advantage of being really easy to implement and reason about. There’s no magic about how the word splits happen; you just take every possible n-letter split of the string, and consider them all. The only judgment call to make is what number to choose for n.
However, it’s also a brute-force approach to the problem and a somewhat naive one. A high percentage of the word splits it creates aren’t actually words, which can create extra work if you are analyzing or aggregating on those words later on. For example, there are 5 English words in mylongjohnsilversexperience, but a 5-grams split of that string will return 23 possible words (
ylong, etc). That’s 4-5 times more processing needed for a 5-gram split of the string than the true word split. Also, a 5-gram split misses all the actual English words—none of the words in that domain name are 5 characters long. That’s going to be a problem no matter what n you choose—it’s rare that a multi-word domain name will be made up of words of all the same length. If you do this to all domains on the internet, you multiply those problems by all 300 million active registrable domain names, and you are left with a lot of extra processing and a lot of missed words.
Also, n-gram splits risk finding words that aren’t actually in the correct split of the string. A 5-gram split will find the word
verse (which is a dictionary word) in the string
silversexperience. That’s correct in that the letters do appear in that order in the string, but the word
verse does not appear in the correct split of that domain name into English words. We noticed this problem when we saw a large number of registrations for impersonation domains for the Long John Silvers Experience site at approximately the same time that Facebook was rebranding to Meta. That meant that the word verse was already trending in domain registrations (as a part of
metaverse), but a 5-gram split of domains was finding
verse in all of those Long John Silvers impersonation sites also which was incorrectly inflating the size of the
Assuming you’re okay with those limitations (or just want a quick-and-dirty starting point), the next obvious question is “how do I implement this?” The easiest way in Python is to use the nltk library. Nltk includes an “ngrams” method, which you would call like:
from nltk import ngrams domain_name = "mylongjohnsilversexperince.com" # do 5-grams initially grams = ngrams(domain_name, 5) for entry in grams: print("".join(entry))
Splitting Strings Using Zipf’s Law
Wisely and slow; they stumble that run fast. (Act 2, Scene 3)
So, having done n-grams and deciding it wasn’t good enough, what do you do instead? To answer that, we have to take a short diversion into the statistics of language and talk about a thing called Zipf’s Law.
Zipf’s Law is an observation that many distributions in nature follow a power-law — that the largest entry in the distribution will be approximately twice as big as the second entry, which will be twice as big as the third, and so on. For example, the biggest city in the U.S. (New York) has approximately 8 million people, while Los Angeles (the second biggest) has approximately 4 million (I’m rounding). The third largest, Chicago, has 2.5 million (which isn’t exactly half…this is approximate, go with me here). This sort of decaying power-law distribution is very common in nature, and Zipf himself noticed that it applied to the frequency of word usage in English (and, it turns out, most languages). This means that the most frequently-used word in English,
the, is used approximately twice as often as the next most popular word,
Why should you care about this? Well, knowing that the English language behaves like this, you can use it to work backwards to find the most likely set of words that make up an arbitrary string. In effect, to use Zipf’s Law to split a string, you start with a dictionary of all English words that includes their respective popularity (so
the is 1,
of is 2, etc). You then try to split up the string in question into those dictionary words (sometimes having letters left over). There will be multiple ways to split a string into words in the dictionary, so to find the “best” one, you give each entry in the collection of possible splits a score: Each possible split is scored based on the popularity of the words it used, so using less popular words in a split gives it a higher score. You also add a penalty score for leftover letters after the split process. After doing all that, the split with the lowest total score is your most likely split of the word.
This works surprisingly well, assuming you have a good dictionary of word frequency. For example, it will correctly split
mylongjohnsilversexperience into [
experience], which is exactly what we’d like it to do. It also gives you only one result for a split, which cuts down on the processing time needed compared to the n-grams approach.
There are some limitations to using Zipf’s Law, though.
Most obviously, this approach is heavily dependent on the dictionary you use. If that dictionary is missing a word, that word simply cannot be used for splitting a string. This becomes important for brand identification. Some brand names aren’t real English words, or are deliberate mis-spellings of English words. If you are trying to use this technique to identify domain names attempting to impersonate these brands, you need to ensure that your dictionary contains those brand names. This also shows up with slang or word shortenings. For example, the word
intim as a short version of the word
Intim is not an English word and will not show up in any common word usage lists, so a domain split of intimrestaurants will not get that split correct. It will, instead, split the word into
tim, which will cause problems if you are aggregating on words to look for trending.
Also, the order of scores of words in the dictionary will have a big impact on the final splits that come out of this approach. For example, if you try to split the domain name
lawyerx.com into words, there are two possibilities: [
yerx], or [
x]. The second is what we would instinctively prefer, but if your penalty isn’t dependent on the length of the dictionary miss (which is often isn’t), and
law is listed as more popular than
lawyer in your dictionary, the Zipf split of that domain will not be what you expect.
This word order issue can come about as a form of bias in the dictionary. Wikipedia is an obvious data source for dictionaries like this (it’s free and it’s big), but Wikipedia is written mostly in the third person (he, she, they), so first- and second-person pronouns do not appear in Wikipedia as often as they would in spoken English or prose. Gender biases also show up in Wikipedia’s word ordering: his appears in the top 40 words in Wikipedia, but
hers is in the 18,000 range and is preceded by words that include
Given that they’re so important to this approach, the next obvious question becomes: Where do you get a dictionary? If you’re prepared to accept the limitations of Wikipedia, one dictionary is available on github. Wikipedia itself also has a list of word lists here for various languages. You can, of course, also try to build your own, or combine these in creative ways.
Now that you have a word list, are there Python libraries that implement this? There is one: wordninja. Wordninja, however, is mostly re-implementing this StackOverflow answer.
Stemming and Lemmatization
O, swear not by the moon, the inconstant moon. (Act 2, Scene 2)
Splitting things into words with a dictionary is a good start, but treating words as pure strings can lead to problems. For example, consider plurals.
cakes are clearly referring to the same object, just multiple of them in the second case. So, when reading a domain name, most people would mentally treat
vanillacakes.com as similar at the very least, if not assume they are the same site. However, to a computer,
cakes are different strings and so would be treated differently by string splitting algorithms like the Zipf splitting we just talked about.
If you want to handle that, say to treat those two domain names as part of the same group, what do you do? There are two main approaches: stemming and lemmatization.
In both cases, the point of the process is to replace words in your sentence/document/string/whatever with words that represent that word but are more general. For example, replacing
run. How they do that, and which words are appropriate for consideration, are different, however.
In stemming, the process is fairly simple: You chop letters off the end of the word until you reach a point where the word is its general form (or close enough to be a unique representation of that word). So
running would chop
ning off the end to get to
run. This works well as a first approximation, especially when handling the majority of plurals. However, the English language is complex, and many words do not work well when treated this way. To address this, some stemming libraries include multiple passes of analyzing a word, with slightly different rules and replacements at each pass. Still, though, words like
saw can be a challenge, as it can be a noun (and therefore left alone) or a verb (and therefore replaced with
see) depending on context.
Lemmatization, on the other hand, looks at the use of the word in context of how it’s used in language and in the sentence being analyzed. Because of that, it can handle words whose spelling changes in different cases, whose handling changes by context (saw as mentioned above), and more complex cases. Lemmatization may replace words not just by case but by their role in a sentence. For example if you lemmatize the sentence
I'm going to go running, the result would look like:
I PRON I 'm AUX be going VERB go to PART to go VERB go running VERB run
This table is showing the original word on the left, the part of speech as identified by the lemmatizer in the middle, and the lemma it’s replacing it with on the right. Note that lemmatization would replace the ‘
be, which is interesting, as it’s pulling apart a contraction.
Obviously, this is an enormous help getting past tense problems, conjugation problems, and word ambiguity issues, and it helps immensely with identifying the general version of the words in a string.
However, there are some limitations with this approach. The first is that it is even more dependent on a dictionary, or in this case a “model,” than the Zipf split algorithm was. So, as with the Zipf split above, if
intim isn’t in the language model, it’s not going to be able to do much with it as a word. Lemmatization may still properly identify it as an adjective, based on where it shows up in a sentence (assuming a domain name is a mini-sentence), but that’s as far as it’s able to go.
The second problem with using lemmatization on domain names is that most punctuation marks aren’t available in domain names, and periods don’t mean the same thing in domain names that they do in English. This makes context inference much harder for the lemmatization frameworks. This has the biggest impact on contractions. Humans will identify in context that some words are really contractions and will mentally insert the apostrophe where needed. Lemmatization algorithms aren’t built to add that, so they won’t get properly tokenized or lemmatized. For example:
letsgorunning.com — humans will properly split that up to
let's go running, but the common tokenizing and lemmatizing frameworks won’t have the domain name context to know that it should split
lets to [
's] and so will miss the pronoun
us in that sentence.
With all that in mind, how can you actually do this in Python? As lemmatizers are much more comprehensive, I’ll stick to them. Stemming is also available in nltk, but it’s not used nearly as often as lemmatization. For lemmatization, there are a number of libraries that implement this:
Each of these is very different. The nltk implementation is very basic, and will require you to manually specify the part of speech for a word to lemmatize it properly. The spaCy and HuggingFace lemmatizers are much more comprehensive, each with its own model for how language works. The difference between the HuggingFace and spaCy models is primarily of application: HuggingFace is designed to work on GPUs with Deep Learning, while spaCy is targeted at traditional NLP analysis running on CPUs.
What Can You Do With Domain Names Split Into Words?
What light through yonder window breaks? (Act 2, Scene 2)
So, now that we can split domain names into words, what can we do with that?
First, DomainTools is already using this in our Domain Blooms work as mentioned earlier: We split a domain name into words and then find trending words in the new registrations of domain names. When a word trends far above its baseline, we call that a Bloom and consider the word something potentially interesting to investigate. When we first started working on Blooms, we initially just split domains by whether any dictionary word existed in the domain name (similar to the pattern matching mentioned at the start). Since then we’ve moved it to a Zipf’s Law-style splitting of words, which helped make the bloom identification more efficient and effective.
Second, we’ve used it to identify what we called Dangerous Words inside a Bloom. In this case, we looked at the words in a domain name inside the set of domains registered with the word “ukraine” in them and found the words that were most strongly correlated with being considered malicious.
Lastly, you can also use these techniques, especially the post-lemmatization results, as input to a clustering algorithm to see if you can find clusters of domains based around words. This was used in this webinar, where they looked at a list of domains and used lemmatized words from the domains to cluster them together by topics. For example, in that webinar, they found clusters of domains that contained the strings
icloud, correctly concluding that all of these were targeting Apple products and services.
But, this is just a starting point. Once you can pull domain names apart into words, you open the door (or window in this case) to a whole world of Statistics, Machine Learning, and Natural Language Processing techniques, from the clustering mentioned above all the way to classification and sentiment analysis.
We’ll discuss some of those clustering techniques (along with some ways to resolve homoglyph attacks) in an upcoming live discussion on February 22 at 10AM PST / 1PM EST. Sign up and save a spot below: