Blog Use Cases

Enhancing dnsdbq Output With Geolocation Data

Introduction

Farsight DNSDB allows users to query domain names and get back IP addresses. These DNS “A record” results link a domain name to one or more IP addresses for a time window. Often when doing an investigation, you may want additional contextual information about an IP Address, such as routing information via the IP’s Autonomous System Number (ASN) or even geolocation data related to that IP address. We’ll discuss how to do both in this post.Farsight DNSDB API’s command line client, dnsdbq, currently has the ability to enhance results with Autonomous System Numbers – users just need to include the dash lowercase “a” option. This works with both presentation format output:

$ dnsdbq -r phloem.uoregon.edu/A/edu -A90d -a
;; record times: 2010-06-24 03:08:15 .. 2024-01-30 14:06:13 (~13y ~223d)
;; count: 137121546; bailiwick: edu.
phloem.uoregon.edu.  A  128.223.32.35  ; AS3582 128.223.0.0/16

and with JSON Lines output mode pretty printed with jq :

$ dnsdbq -r phloem.uoregon.edu/A/edu -A90d -a -j | jq '.'
{
  "count": 137121546,
  "time_first": 1277348895,
  "time_last": 1706623573,
  "rrname": "phloem.uoregon.edu.",
  "rrtype": "A",
  "bailiwick": "edu.",
  "rdata": [
    "128.223.32.35"
  ],
  "_dnsdbq": {
    "anno": {
      "128.223.32.35": {
        "asinfo": {
          "as": [
            3582
          ],
          "cidr": "128.223.0.0/16"
        }
      }
    }
  }
}

If you run into an ASN you don’t recognize when enhancing with ASN information, Hurricane’s Electric BGP site makes it easy to  search for the ASN you’re interested in.

Routing information is only one example of how you could potentially enhance dnsdbq output. What if you wanted to enhance DNSDB results with geolocation data? At the time this article was written, dnsdbq doesn’t have the ability to add geolocation data organically, so we’ll show you how you can add it yourself “after the fact.”

Geolocation Data in dnsdbq

Let’s begin with a simple question: exactly what geolocation data do we want to add? For example, should our geolocation-enhanced data just consist of decimal degrees latitude and longitude? Or should it also include a city name, state name, and country code? We’ll assume that we want all of the above.

Given the potential volume of queries associated with enriching millions of results, rather than querying a geolocation API service interactively over the Internet, we’ll assume that we need to download and install a copy of the geolocation data on our local machine. For this Python3 proof of concept, we’ll assume we’re using DB-IP’s ip-to-city-lite database

The January 24 “Lite” version as used for this proof of concept has “just” 5,455,129 records (vs. 27,240,259 records for the paid commercial version from the same provider), but the “Lite” version should still work fine for a proof of concept. The license for that database states:

“You are free to use this IP to City Lite database in your application, provided you give attribution to DB-IP.com for the data. In the case of a web application, you must include a link back to DB-IP.com on pages that display or use results from the database. You may do it by pasting the HTML code snippet below into your code: <a href=’https://db-ip.com’>IP Geolocation by DB-IP</a>”

Thank you for making that database available, DB-IP!

We’ll access the DB-IP IP-to-City-Lite binary database in Python3 using GeoIP2-python (that module’s Apache 2 Licensed). That library’s easy to install with:

$ pip3 install geoip2

Distance From Some User-Specified Location

In addition to absolute geolocation, we’re also going to compute the distance to each geo-located IP address from a user-specified reference location. That is, we’ll report “How close am I to each geolocated site?” We’ll define our reference point as a latitude, longitude coordinate pair in the file ~/.location 
For example:

44.0521, -123.0868	<-- That's the location of the Eugene, OR's City Hall

We’ll use the Haversine formula to compute the distance between the points we’ll be geo-enhancing and our chosen reference location. We’ll include the reference location and computed distance in our geolocation-enhanced output as additional metadata.

Not All Addresses Can Be Geolocated

Globally routable addresses can normally be geolocated. Some addresses, however, are inherently NOT geo-locateable. Those categories of addresses include the special purpose addresses listed at:

We’ll screen those out of the geo-enhancement we perform. We’d normally do that screening with
ipaddress.ip_address(myip).is_private and similar calls, but we ultimately found it easier to build our own filter of addresses to exclude using https://github.com/jsommers/pytricia

Sample Raw dnsdbq JSON Lines Output, and Sample Enhanced dnsdbq JSON Lines Output

Let’s see what geo-located data from dnsdbq might look like. We’ll begin with normal output from dnsdbq:

$ dnsdbq -r ucla.edu/A -A90d -Tdatefix -j > ucla.edu.jsonl

That command searches for “A” records (IPv4 address records) for the exact domain name “ucla.edu,” restricting the returned results to those seen sometime in the last 90 days, with dates shown in human-readable format and with JSON Lines output going to the specified output file. There’s only one result for that query. “Pretty-printing” that result with jq, we see the following (colorized for emphasis here):

$ jq '.' ucla.edu.jsonl
{
  "count": 484650,
  "time_first": "2022-12-16 12:52:32",
  "time_last": "2023-12-29 21:15:01",
  "rrname": "ucla.edu.",
  "rrtype": "A",
  "bailiwick": "ucla.edu.",
  "rdata": [
    "3.33.167.235",
    "15.197.181.170"
  ]
}

You’ll notice that even though we have just a single result, that result has two IP addresses in the Rdata section, one colorized green and the other colorized blue here. When we geo-enhance that output by saying:

$ enhance_json_2 ucla.edu.jsonl > ucla.edu.enhanced.jsonl

We’ll end up with a file that looks like the following (this would normally be a single long line, but we’ve wrapped and colorized it for display here):

$ cat ucla.edu.enhanced.jsonl
{"count": 519423, "time_first": "2022-12-16 12:52:32", "time_last": 
"2024-01-22 17:08:48", "rrname": "ucla.edu.", "rrtype": "A", 
"bailiwick": "ucla.edu.", "rdata": ["3.33.167.235", "15.197.181.170"], 
"enhanced_rdata": [{"ip": "3.33.167.235", "lat": 45.5017, "long": -73.5673, 
"city": "Montreal", "state": "Quebec", "country": "CA", 
"start_loc": "44.0521, -123.0868", "distance": 2391.82541},
{"ip": "15.197.181.170", "lat": 47.6222, "long": -122.337, 
"city": "Seattle (South Lake Union)", "state": "Washington", 
"country": "US", "start_loc": "44.0521, -123.0868", "distance": 249.29318}]}

In raw (unformatted for publication) format, this sample output passes JSON Lines validation when tested with multiple validators:

The “magic” behind getting geo-enhanced output is the little Python3 script called enhance_json_2 (see the installation instructions in Appendix I and the code in Appendix II).

enhance_json_2 does several things in producing geo-enhanced output:

  • It reads a file of dnsdbq JSON Lines results
  • It filters out any non-public IP addresses (RFC1918 address space, IP multicast addresses, etc.)
  • It looks up the remaining IPs to get the associated geo-location data
  • It computes the relative distance from the user-supplied reference location to the geolocated points
  • It writes out the new geo-enhanced file

enhance_json_2 is relatively quick: it can geo-enhance a million DNSDB results in under 25 seconds on a vanilla M1 Macbook Pro laptop. 

The number of observations doesn’t get changed by enhancement, but the total size of the results in octets increases by almost 2.2X for million results from *.edu domains.

The free geo-location data used to enhance the data is relatively coarse, but is generally sufficient for demonstrating this proof of concept. Enthusiastic users interested in greater precision may want to invest in a proprietary licensed copy of the database offering finer precision.

Conclusion

Farsight DNSDB contains a wealth of information connecting domain names to IP addresses. But sometimes you may want more contextual information about an IP address. The Farsight DNSDB toolchain currently can augment IP addresses with ASNs. In this post, we showed how you can further extend this contextual information to add geolocation information from freely available sources online using some python code.

We hope you find the ability to add geolocation data to DNSDB API results useful. Stay tuned, because In a follow-on blog, we’ll show how you can map the geo-located results you’ve obtained.

Appendix I. Installation Instructions

(1) Python3

https://www.python.org/downloads/

For simplicity, we’ll hardcode the Python execution path as

#!/usr/local/bin/python3

Change that if necessary. If you want to use a Python virtual environment, feel free to get that set up.

We also assume that you’ll be installing scripts to /usr/local/bin and that directory’s in your default path.

(2) Reference (base) location
Put your reference latitude and longitude coordinates in ~/.location  For example:

44.0521, -123.0868

You’ll only need to do this once, unless your reference location should ever change.

(3) DB-IP “IP to City Lite” data in MMDB format

Get this from https://db-ip.com/db/lite.php and ungzip that database. Put the full filespec for that ungzip’d database into the file  ~/.geo-database-location

For example:

$ cat ~/.geo-database-location
/Users/jsmith/dbip-city-lite-2024-01.mmdb

(4) enhance_json_2: Copy the script in Appendix II to the file enhance_json_2 using vim or your favorite text editor. Make that file executable: 

$ chmod a+rx enhance_json_2

As root, copy that file to /usr/local/bin 

$ sudo cp enhance_json_2 /usr/local/bin/

(5) Install any missing 3rd-party Python modules:

These should generally be installable with pip3 install

geoip2 (see https://pypi.org/project/geoip2/)

pytricia (see https://github.com/jsommers/pytricia/)unidecode (see https://pypi.org/project/Unidecode/)

Appendix II. enhance_json_2

$ cat enhance_json_2
#!/usr/local/bin/python3
""" add geotagging info to dnsdbq json lines output """

# I prefer to explicitly open and close when using multiple files
# pylint: disable=consider-using-with

# w/o the following, using the pytricia tree makes pylint complain
# pylint: disable=c-extension-no-member

# the sig handler's arguments generate a complaint otherwise
# pylint: disable=unused-argument

from collections import defaultdict
from datetime import date, datetime
from functools import partial
from ipaddress import ip_address, IPv4Network, IPv6Network
import json
import math
import os
import signal
import sys
from unidecode import unidecode
import geoip2.database

# https://github.com/jsommers/pytricia
import pytricia

# create an explicit error-reporting function per
# https://stackoverflow.com/questions/5574702/how-do-i-print-to-stderr-in-python
error = partial(print, file=sys.stderr)


# handle control-C's
# https://stackoverflow.com/questions/1112343/how-do-i-capture-sigint-in-python
def signal_handler(sig, frame):
    """ Handle Ctrl-C's w/o stackdump """
    error("Ctrl-C Interupt Entered: Terminating")
    sys.exit(0)


signal.signal(signal.SIGINT, signal_handler)

# ipaddress.ip_address(myip).is_private would be my normal approach, but it
# apparently doesn't correctly handle some IP address ranges (100.64.0.0/10?)
# Therefore we'll just build this Patricia tree of private address space
# (defined globally to minimize function overhead)
pyt = pytricia.PyTricia(128)
pyt.insert(IPv4Network('0.0.0.0/8'), True)
pyt.insert(IPv4Network('10.0.0.0/8'), True)
pyt.insert(IPv4Network('100.64.0.0/10'), True)
pyt.insert(IPv4Network('127.0.0.0/8'), True)
pyt.insert(IPv4Network('169.254.0.0/16'), True)
pyt.insert(IPv4Network('172.16.0.0/12'), True)
pyt.insert(IPv4Network('192.0.0.0/24'), True)
pyt.insert(IPv4Network('192.0.2.0/24'), True)
pyt.insert(IPv4Network('192.31.196.0/24'), True)
pyt.insert(IPv4Network('192.52.193.0/24'), True)
pyt.insert(IPv4Network('192.88.99.0/24'), True)
pyt.insert(IPv4Network('192.168.0.0/16'), True)
pyt.insert(IPv4Network('192.175.48.0/24'), True)
pyt.insert(IPv4Network('198.18.0.0/15'), True)
pyt.insert(IPv4Network('198.51.100.0/24'), True)
pyt.insert(IPv4Network('203.0.113.0/24'), True)
pyt.insert(IPv4Network('224.0.0.0/4'), True)
pyt.insert(IPv4Network('240.0.0.0/4'), True)
pyt.insert(IPv4Network('255.255.255.255/32'), True)

pyt.insert(IPv6Network('::1/128'), True)
pyt.insert(IPv6Network('::/128'), True)
pyt.insert(IPv6Network('0:0:0:0:0:0:0:0/8'), True)
pyt.insert(IPv6Network('400:0:0:0:0:0:0:0/6'), True)
pyt.insert(IPv6Network('::ffff:0:0/96'), True)
pyt.insert(IPv6Network('64:ff9b::/96'), True)
pyt.insert(IPv6Network('64:ff9b:1::/48'), True)
pyt.insert(IPv6Network('100::/64'), True)
pyt.insert(IPv6Network('2001::/23'), True)
pyt.insert(IPv6Network('2001:db8::/32'), True)
pyt.insert(IPv6Network('2002::/16'), True)
pyt.insert(IPv6Network('2620:4f:8000::/48'), True)
pyt.insert(IPv6Network('fc00::/7'), True)
pyt.insert(IPv6Network('fe80::/10'), True)


def ok_to_try(myip3):
    """ filter out private address space and similar non-global IPs """
    can_try = True
    if pyt.get_key(myip3):
        can_try = False
    return can_try


# https://community.esri.com/t5/coordinate-reference-systems-blog/distance-on-a-sphere-the-haversine-formula/ba-p/902128
def haversine(lata, lona, latb, lonb):
    """ compute the distance between two lat,lon points """

    # Coordinates in decimal degrees (e.g. 2.89078, 12.79797)

    phi_1 = math.radians(lata)
    phi_2 = math.radians(latb)
    delta_phi = math.radians(latb - lata)
    delta_lambda = math.radians(lonb - lona)
    a = math.sin(delta_phi / 2.0) ** 2 + math.cos(phi_1) * \
        math.cos(phi_2) * math.sin(delta_lambda / 2.0) ** 2
    R = 6371000  # radius of Earth in meters
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    miles = round((R * c * 0.0006213712), 5)
    return miles


# WE USE FOUR "FILES:" RESULTS TO ENHANCE, A FILE WITH OUR REFERENCE
# LOCATION, OUR GEO-IP DATABASE FILE, AND OUR OUTPUT "FILE" (E.G. STDOUT)

# FILE #1
#
# Confirm that the results-to-enhance file was specified on the command line
# (I suppose we could use fileinput.input() instead, but I prefer this more
# bulletproof approach)
if len(sys.argv) < 2:
    error("\n****FATAL ERROR: Must specify the JSON results file as " +
          "the first positional argument to this command")
    sys.exit(0)

# Confirm that the specified results file for enhancement actually exists
myfilename = sys.argv[1]
if os.path.isfile(myfilename):
    f = open(myfilename, 'r', encoding='ascii')
else:
    error("\n****FATAL ERROR:", myfilename, "doesn't exist")
    sys.exit(0)

# handle reporting version info if requested...
# don't want to check for this until we know at least one arg's provided
if sys.argv[1] == "--version":
    print("enhance_json_2 version 1.0")
    sys.exit(0)

# FILE #2
#
# retrieve the ref ("starting") lat, lon  we need for computing distances
# it lives in the file ~/.location (create it with your favorite editor)

homedir = os.path.expanduser('~')
start_loc_file = homedir + "/.location"

if os.path.isfile(start_loc_file):
    f2 = open(start_loc_file, 'r', encoding='ascii')
    coords = f2.read().rstrip()
    two_vals = coords.replace(',', ' ').split()
    lat1 = float(two_vals[0])
    lon1 = float(two_vals[1])
    f2.close()
else:
    error("\n****FATAL ERROR: "+start_loc_file+"doesn't exist.")
    sys.exit(0)

# FILE #3:
#
# This third file points at our geo-database-file location
# If it doesn't already exist, create it with your favorite editor

dbfile_loc_file = homedir + "/.geo-database-location"

if not os.path.isfile(dbfile_loc_file):
    error("\n****FATAL ERROR: Config file with the location " +
          "of the geo-data file doesn't exist.")
    error("Create ", dbfile_loc_file, "and put the filespec " +
          "of the geo-data file in-it.")
    sys.exit(1)

# <a href='https://db-ip.com'>IP Geolocation by DB-IP</a>"
f3 = open(dbfile_loc_file, encoding="utf-8")
actual_filename = f3.readline().rstrip()

if not os.path.isfile(actual_filename):
    error("\n****FATAL ERROR: The geo-data file itself doesn't exist.")
    error("Download IP to City Lite in MMDB format from " +
          "https://db-ip.com/db/lite.php then ungzip that file.")
    sys.exit(1)

# make sure the geodata is relatively current, or complain
today = date.today()
modify_date = datetime.fromtimestamp(
    os.path.getmtime(actual_filename)).strftime('%Y-%m-%d')
modify_date_as_date = datetime.strptime(modify_date, "%Y-%m-%d").date()
age = (today-modify_date_as_date).days

if age > 45:
    error("\n****FATAL ERROR: Geo-data file is more than " +
          "45 days old. Time to update?")
    error("Download the updated IP to City Lite file in MMDB format from " +
          "https://db-ip.com/db/lite.php")
    error("gunzip it, then update " + dbfile_loc_file + " with that filespec")
    sys.exit(1)

#####

# Quoting https://www.geeksforgeeks.org/defaultdict-in-python/: "Defaultdict
# is a sub-class of the dictionary class that returns a dictionary-like
# object. The functionality of both dictionaries and defaultdict are almost
# same except for the fact that defaultdict never raises a KeyError. It
# provides a default value for the key that does not exist."

enhanced_rdata = defaultdict(dict)
enhanced_full_data = defaultdict(dict)

reader = geoip2.database.Reader(actual_filename)

while True:
    my_en_rdata = []
    line = f.readline()
    if not line:
        break

    # newdata is a dictionary with the regular (unenhanced) results as read-in
    try:
        newdata = json.loads(line)
    except ValueError:
        error("\n****FATAL ERROR reading JSON Lines format results as input")
        error("Does "+myfilename+" really contain JSON Lines format results?")
        sys.exit(1)

    # must explicitly call the newdata.items() as a list or we get a runtime
    # calling it as a list will make a copy of the original newdata.items()
    for key, value in list(newdata.items()):
        # we can only enhance A and AAAA records
        if key == "rrtype":
            if value not in ("A", "AAAA"):
                break
        if key == "rdata":
            # the rdata field may have multiple values, we need to process
            # all of them
            my_rdata_index = 0
            my_en_rdata = []
            # value has the rdata IPs
            for myip in value:
                # ensure we don't try to enhance a private address IP, etc.
                if ok_to_try(myip):
                    # get the geo enhancement data for the valid IP
                    response = reader.city(ip_address(myip))

                    # response object returned by reader.city will have
                    # a bunch of values including the IP address lat, lon.
                    # Round lat and long to five decimal points.
                    lat2 = float(round(response.location.latitude, 5))
                    lon2 = float(round(response.location.longitude, 5))

                    # compute the distance between start and dest location,
                    # rounding the distance to num of miles w two decimal pts
                    dist = haversine(lat1, lon1, lat2, lon2)

                    # city, state and country may have weird unicode chars
                    # unless sanitized, see https://pypi.org/project/Unidecode/
                    city = unidecode(str(response.city.name),
                                     errors='replace')
                    state = unidecode(str(response.subdivisions.
                                      most_specific.name),
                                      errors='replace')
                    country = unidecode(str(response.country.iso_code),
                                        errors='replace')

                    # eliminate embedded apostrophes to avoid quoting problems
                    city = city.replace("'", '')
                    state = state.replace("'", '')
                    country = country.replace("'", '')

                    # now create a subscripted entry for the IP as a
                    # JSON Lines entry in our dictionary
                    enhanced_rdata[my_rdata_index] = {
                        'ip': myip,
                        'lat': lat2,
                        'long': lon2,
                        'city': city,
                        'state': state,
                        'country': country,
                        'start_loc': coords,
                        'dist': dist}

                    my_en_rdata.append(enhanced_rdata[my_rdata_index])
                    my_rdata_index = my_rdata_index + 1

            # we've done all the IPs, add the enhanced Rdata
            newdata["enhanced_rdata"] = my_en_rdata
            # output the enhanced dictionary as JSON Lines format data
            print(json.dumps(newdata))

            # get ready for the next rdata to enhance, if any
            my_rdata_index = 0

# tidy up. (f2's already closed)
f.close()
f3.close()