
Today Farsight Security announced DNSDB 2.0 Flexible Search for DNSDB API. Flexible Search offers powerful new search capabilities that enhance DNSDB API, and which make it possible to easily do the DNSDB searches you’ve always wished you could make.
Early Adopter Access will be available on August 19th, 2020 and General Availability is scheduled for October 20th, 2020. If you’re interested in applying for Early Adopter access, please contact [email protected].
Flexible Search will be bundled at no charge for paid DNSDB API customers (and customers given access to DNSDB API under a grant from Farsight), but will NOT be included as part of DNSDB Community Edition, the free, entry-level version of our flagship solution.
Flexible Search is a “finding aid” that supplements and enhances (but does not replace) Standard DNSDB API.
Flexible Search offers users three search syntax modes in DNSDB Scout, and two otherwise.
The goal of this article is to give you an introduction to regular expressions (“regexes”) for those who find themselves wondering “what the heck ARE these ‘regex’ things that some ‘techies’ keep talking about?”
At its most basic, a regular expression (or “regex”) is just a string that describes a pattern to be matched.
For example, imagine a program scanning lines in one or more files, looking for lines that contain the regular expression pattern of interest. When it finds a line with that pattern, it prints that line out. Simple as that sounds, regexes can be extremely powerful and useful. Regexes are routinely used in the cybersecurity world by:
In a comparatively short article like this one, we can only “scratch the surface” when it comes to all the features and capabilities of regular expressions, but we hope that even this short introduction will still serve to pique your interest in regular expressions and motivate you to learn more about them. If that happens, there are a number of regular expression books you should check out, including O’Reilly’s:
To help illustrate how regular expressions work, we’ve created a small sample data file with fifty-two assorted domain names called
domains.txt
(see Appendix I). We’re going to use that file as data for our examples.
GNU egrepWhile some of you may participate in our Early Adopter Program, most of you won’t have access to DNSDB Flexible Search until October 20th, 2020. Therefore, we’ve couched the following discussion in terms of the commonly available
GNU egrep
command. That command will treat regular expressions the same way that DNSDB 2.0 Flexible Search regular expressions will work, allowing eager people to get some time in learning and practicing before Flexible Search goes to General Availability status.
Once you get access to Flexible Search, using regular expressions will be as simple as plugging them into the Find field in DNSDB Scout, or using the
--regex
qualifier to the
dnsdbflex
command-line client.
The Unix “
grep
” command name is a “portmanteaux” word. It was built from parts of the words in the phrase “globally search for a regular expression and print.”
It is a staple command-line utility on Unix systems (and on Unix-like operating systems such as Linux and Mac OS X) and should exist (in some form or another) on virtually every Unix or Unix-like system.
egrep
is an enhanced version of
grep
. It’s what we’re going to use for the examples shown in this article.
On the latest version of Mac OSX (aka Catalina), the system-provided
egrep
appears to (still) be:
$ /usr/bin/egrep --version
egrep (BSD grep) 2.5.1-FreeBSD
The 2.5.1 version of
egrep
is known to have bugs, bugs which have been fixed in later versions. Unfortunately, because the later versions use a different open source license, Mac OSX has not updated to one of the later version(s) of
egrep
where that bug has been corrected. The bug is serious enough that it visibly impacts the results you may get for even relatively simple queries.
Thus, we normally prefer to use, and recommend that you use, the GNU version of
egrep
. If
GNU egrep
is installed on your system (and used by default), you should see something like:
$ egrep --version
grep (GNU grep) 3.3
[etc]
If
GNU egrep
is not installed, you may be able to install
GNU egrep
via your operating system’s package manager. For example, on a Mac using homebrew you can install
GNU grep
by saying:
$ brew install grep
You can also download and install
GNU egrep
from source.
Regular expressions are just strings (sometimes quite cryptic-looking strings, but still, just strings). We’ll normally put regular expressions inside single quote marks. Each regex gets built using a combination of:
Character Name Special Thing This Symbol Means
\ backslash "Escapes" the character after this one
. dot Match any one character here
* star Repeat the previous zero or more times
^ caret Match start of line
$ dollar sign Match end of line
? question mark Optional (match zero or one time only)
+ plus sign Matches one or more time
| vertical bar Logical or (match either)
{ and } curly braces Repetition count {min}, {min, max}, or {,max}
( and ) parentheses Define logical subexpression ("create grouping")
[ left square bracket Define character class
If you want to literally match those metacharacters, prefix them with a backslash. [Note: Some versions of
egrep
may attempt to guess if a metacharacter “should” be treated as a metacharacter or as the literal character. That is risky, however, so we generally urge you to explicitly indicate if you want a metacharacter to be treated as a literal.]
Shorthand character classes (such as
\w
,
\d
,
\s
) are used in someregular expression implementations, but will not be available in DNSDB’sFlexible Search regex implementation.
Bracketed character classes are either predefined character classes that look like the following (this is not an exhaustive list of these):
[[:alpha:]] Any upper or lower case alphabetic character
[[:digit:]] Any digit from 0 to 9
[[:alnum:]] Any alphanumeric character
[[:xdigit:]] Any hexadecimal digit (e.g., 0-9 plus A-F or a-f)
or classes that the user defines, such as
[aeiouy] Matches any vowel (or pseudo-vowel, in the case of "y")
[^aeiouy] Matches any NON-vowel (including other letters, numbers, symbols, etc.)
Note that MOST metacharacters lose their special meanings within square brackets (a notable exception is the caret symbol, as just shown in the [
^aeiou
] example).
Let’s use a regex to find lines from our sample data file that contain the literal string “off”. We’ll run the egrep command from a Terminal window on our Mac:
$ egrep 'off' domains.txt
coffee.com
office.com
office365.com
This is a pretty straight-forward command: it takes a regular expression (in this case the literal string off, in single quote marks) and looks for matching lines in the specified file (
domains.txt
). Three “hits” are found:
coffee.com
,
office.com
and
office365.com
. Those get printed out when we run that command.
While in this case we just looked for a short three-character string, we could have looked for a single character, many characters, or even multiple words. (Just be sure to enclose the literal string to be matched in single quote marks if the string includes spaces!)
If we wanted to find lines that DON’T contain the string ‘off’, we can use the
egrep -v
option to find lines that DON’T match the specified pattern:
$ egrep -v 'off' domains.txt
all lines EXCEPT coffee.com, office.com and office365.com get output here
Regular expressions are case sensitive by default (so if we’d looked for ‘OFF’ instead of ‘off’, we wouldn’t have found any matches).
If we want
egrep
to do case Insensitive matches, we can add the
-i
option to our
egrep
command:
$ egrep -i 'OFF' domains.txt
coffee.com
office.com
office365.com
Another handy option to
egrep
is the
--color
option. It highlights the text that matches the regular expression we supplied:
$ egrep --color 'off' domains.txt
coffee.com
office.com
office365.com
We don’t urgently need this option to understand such a simple match, but when regexes get more complex – or we make a mistake constructing our regex – highlighting the text that matched a regex can really come in handy as a debugging tool.
Let’s do another literal substring regex.
What if we want to find lines that have EITHER the literal substring ‘go’ OR the literal substring ‘off’?
GNU egrep
can help use do that with the vertical bar (or “pipe”) meta character.
$ egrep --color 'go|off' domains.txt
coffee.com
duckduckgo.com
eugene-or.gov
google.com
house.gov
office.com
office365.com
oregonstate.edu
senate.gov
supremecourt.gov
uoregon.edu
whitehouse.gov
Note that the vertical bar (“pipe”) characters is a metacharacter – it does NOT need to be physically part of the string text we’re matching.
If helpful or necessary, you can also use parentheses to set off the limits of an alternating match. For example:
$ egrep --color 'e(go|ug)' domains.txt
eugene-or.gov
oregonstate.edu
uoregon.edu
That pattern matches all records that have
ego
or
eug
in them.
Up until this part of the article, we’ve been matching literal strings. That’s cool and useful, but the real power of regular expressions comes when we begin to work with wildcards – in this case literally the dot (“.”) character. Dot is a metacharacter that matches any single character.
$ egrep --color 'g.p' domains.txt
blogspot.com
If we have two dots in a row, that matches any two characters:
$ egrep --color 'r..e' domains.txt
lclark.edu
marines.mil
supremecourt.gov
and we could also search for any three characters in a row, any four characters in a row, etc.
Note that if we want to match an ACTUAL dot (and dots are obviously VERY common in domain names), we need to ask to match an escaped (“backslashed”) dot:
$ egrep --color '\.k12\.' domains.txt
bethel.k12.or.us
cal.k12.or.us
springfield.k12.or.us
If we didn’t remember to escape those “real dots,” specifying an unescaped dot might coincidentally match real dots, but they’d also match any OTHER single character in that spot, too.
If you think dot was cool, wait until you learn about dot star (‘.*’) – it’s VERY cool!
If we had a regular expression that was simply ‘.*’ it would match all lines.
Therefore, most matches that contain ‘.*’ also include other specific patterns to match. For example, let’s find lines that have a
b
, then zero or more other characters, then a
c
:
$ egrep --color 'b.*c' domains.txt
bing.com
blogspot.com
crabcake.com
ebay.com
facebook.com
github.com
youtube.com
If we didn’t have the star metacharacter to give us flexibility here, we’d have to write a much “clunkier” regex with all possible patterns of zero or more dots in between the two letters of interest:
egrep '(bc|b.c|b..c|b...c|b....c|b.....c|b......c|b.......c|b........c)' domains.txt
same output as the previous example omitted here
Yuck! And just imagine how ugly that expression would get if one of the domain names in the file happened to be a long name with a
b
near the start and a
c
twenty or thirty characters later! Truly, the “magic of dot star” is a huge convenience when it comes to writing some regular expressions.
When
GNU egrep
finds matches, sometimes there are different options that might work. For example, if you asked to match
'^st.*o'
there are three ways it could match one line from our sample data:
stackoverflow.com OR stackoverflow.com OR stackoverflow.com
All three of those matches start with “st” and end with “o”, right? But which one of those will
GNU egrep
return by default?
The answer is that
GNU egrep
agrees with the fictional character Gordon Gekko, played by Kirk Douglas in the 1987 movie “Wall Street,” who became (in)famous for saying “Greed is good.”
By default, wildcards in
grep
will always try to match as much as possible while still satisfying the requested pattern. So in this example, it will match as shown in the last of the possible result,
stackoverflow.com
.
We’ve seen how dot matches any SINGLE character, and dot star matches any ZERO OR MORE characters. But what if want to match a single character from just an enumerated set of characters? For example, what if we want to match:
ba, e, i, o, u, or y
It turns out that regular expressions can help us do this as well, using square brackets (as introduced in Section 4, above) to define a character set:
$ egrep --color 'b.*[aeiouy].*c' domains.txt
bing.com
blogspot.com
crabcake.com
ebay.com
facebook.com
youtube.com
If you’re referring to a contiguous range of characters, rather than a short, enumerated list of characters, you can take advantage of the dash character to avoid having to type a long list:
If you want to put a literal caret (^) in a list of characters, you can, just don’t put it first (if you do, it will be interpreted as meaning “take the complement of the characters that follow).
If you want to include a literal right square bracket (]) in a list of characters, you can, you just must use it as the FIRST character in the list of characters.
If you want to put a literal dash (-) in a list of characters in square brackets, put it LAST.
You can also use “repetition factors” or “counts” to ask for multiples of patterns. For example, if you wanted to find names from our sample file that had two successive vowels, you could write:
$ egrep --color '[aeiouy]{2}' domains.txt
coffee.com
ebay.com
eou.edu
eugene-or.gov
facebook.com
freedom.com
geoduck.com
google.com
house.gov
oit.edu
paypal.com
reed.edu
sou.edu
springfield.k12.or.us
supremecourt.gov
uoregon.edu
whitehouse.gov
wikipedia.org
wou.edu
yahoo.com
youtube.com
In addition to asking for exactly a specific value, you can also specify a repetition range, such as:
For example:
$ egrep --color 'ube?\.com$' domains.txt
github.com
youtube.com
In this case, the “e” was optional, which is why youtube.com AND github.com successfully matched.
The patterns that we’ve been matching have all been “floating” patterns. Those patterns can potentially match suitable text seen anywhere in lines they’re scanning. But what if we only want to match a particular pattern at the start of a line, or at the end of a line? Those type of searches are called “anchored searches,” and we can use special metacharacters to limit our results:
^ (the caret symbol) "At the start of the line"
$ (the dollar sign) "At the end of the line"
For example, let’s find the domains in the file that are ‘
.edu
‘ domains:
$ egrep --color '\.edu$' domains.txt
4j.lane.edu
eou.edu
lanecc.edu
lclark.edu
oit.edu
oregonstate.edu
pdx.edu
reed.edu
sou.edu
uoregon.edu
willamette.edu
Important note: In DNSDB 2.0 Flexible Search, domain names in RRnames (and some Rdata) are written with a “formal ending dot.” Literal dots are also escaped with a backslash. That means that the domain name
wou.edu
would be written in regular expression format as:
wou\.edu\.$
If that’s the case for the stuff you’re matching against the anchored search would need to be written:
$ egrep --color '\.edu\.$' domains.txt
rather than just
$ egrep --color '\.edu$' domains.txt
Or as another example, let’s find the domains that begin with an
s
:
$ egrep --color '^s' domains.txt
senate.gov
sou.edu
springfield.k12.or.us
stackoverflow.com
supremecourt.gov
You may also want to know that there are some “specialty” versions of grep, such as:
You’ve now had a bit of a whirlwind introduction to regular expressions. If you want to learn more, check out the books mentioned in the introduction, or consider trying one of the online interactive regular expression tutorials.
Regular expressions may feel a bit like they’re “brain teasers” or puzzles from The New York Times puzzle page, but if you tackle them with the right attitude, you may find they’re exceptionally powerful and sort of fun, too!
The author would like to acknowledge valuable reviews and commentary from colleagues, including (in alphabetical order) Chris Mikkelson, Jeremy Reed, Chuq Von Rospach, Stephen Watt, and Eric Ziegast. Any remaining issues or errors are solely the responsibility of the author.
$ cat domains.txt
4j.lane.edu
af.mil
amazon.com
apple.com
army.mil
bethel.k12.or.us
bing.com
blogspot.com
cal.k12.or.us
coffee.com
crabcake.com
duckduckgo.com
ebay.com
eou.edu
eugene-or.gov
facebook.com
freedom.com
geoduck.com
github.com
google.com
house.gov
instagram.com
lanecc.edu
lclark.edu
linkedin.com
live.com
marines.mil
microsoft.com
msn.com
navy.mil
netflix.com
office.com
office365.com
oit.edu
oregonstate.edu
paypal.com
pdx.edu
reddit.com
reed.edu
senate.gov
sou.edu
springfield.k12.or.us
stackoverflow.com
supremecourt.gov
twitter.com
uoregon.edu
whitehouse.gov
wikipedia.org
willamette.edu
wou.edu
yahoo.com
youtube.com
More on validating input fields: An example of input validation might be a rule that says “the employee salary field can only contain numbers, commas, a decimal point and/or a dollar sign.” “$83,412.15” would pass that validation definition but “$7K/month” would not. More carefully-defined validation rules might be used to screen out typos/data entry errors such as “8$3,412.15” or “$83,,412.15” or “$83,412.155” Validation rules might also be used to identify likely-out-of-range-values such as “$8341215.
Joe St Sauver Ph.D. is a Distinguished Scientist and Director of Research with Farsight Security®, Inc..