Python
Blog General Infosec

Efficiently Accessing a Moderately-Large Sorted and Uniquely-Keyed CSV File in Python3 with MTBL

Introduction

Imagine that you’ve got four or five data points for every registrable domain on the Internet, keyed by the domain name. Such a file would be over 380 million rows in length, and perhaps 10GB in size, structured as a simple sorted comma-separated variable (CSV) file.

You’d like to be able to access entries in that file by key (e.g., domain name) on a Mac laptop in Python3, returning the associated data points much as you might use a Python3 dictionary for smaller data structures.

It turns out that MTBL files work great for this. This article explains how you can do that.

Installing pymtbl

MTBL files are an implementation of immutable sorted string table (SST) files provided by Farsight Security, Inc. (now part of DomainTools)

If you don’t already have the MTBL package installed, see section II of “Passive DNS and SIE File Formats” for help doing so.

While MTBL files can be written and read directly using a “C” language interface, many data analysts prefer Python3. Fortunately, a Python3 interface to MTBL files is available

Installing pymtbl requires cython. If you don’t already have cython, it can be installed using brew.

With libmtbl and cython installed, we should (finally!) be able to successfully install pymtbl with:

$ git clone https://github.com/farsightsec/pymtbl.git
$ cd pymtbl
$ sudo /usr/local/bin/python3 setup.py install

Reading a Sorted CSV and Converting That into an MTBL File

Assume our CSV file exists, is sorted, and has a unique key. For example, perhaps part of it looks like:

0--0.de,1,,,,0
0--0.dev,1,3,4,2,10
0--0.eu,1,,,,0
0--0.in,1,,,,18
0--0.jp,1,,,,32
0--0.link,1,,,,15

We’ll convert that file into an MTBL with the following code:

$ cat sample_write_mtbl.py
#!/usr/local/bin/python3
""" write a sample mtbl file """

### Run with:
###    ./sample_write_mtbl.py < small-sample.csv

import os
import sys
import mtbl

mtbl_output_file='sample.mtbl'

try:
    os.remove(mtbl_output_file)
except:
    pass

w = mtbl.writer(mtbl_output_file, compression=mtbl.COMPRESSION_ZLIB)

for line in sys.stdin:
    # split the input into fields
    x = line.rstrip().split(',')
    # our first field is going to be the key
    key=x[0]
    # we'll use the remaining five fields as comma separated variables
    val= str(x[1:5]).
         replace("]","").replace("[","").
         replace("'",””).replace(" ","")
    # this is all we need to write the MTBL
    w[key]=val

Some things to note about that code:

  • We’re reading from sys.stdin, but always write to sample.mtbl.  As part of doing so, we’ll remove any EXISTING sample.mtbl file
  • We assume that the key is the domain name, in the first column, and that the file is sorted by it alphabetically
  • We’re writing our values to the MTBL file as a single string. Because we anticipate writing literally hundreds of millions of records, we minimize the amount of formatting, stripping brackets and other formatting and keeping literally just comma separated variables
  • If you prefer to write individual integer values, see https://github.com/farsightsec/pymtbl at “If you want to store integers in your mtbl use varint_encode() and varint_decode() or struct.pack and struct.unpack to do so”
  • We use zlib for MTBL compression. Other options are available, but we find zlib provides some of the best compression with reasonable efficiency/acceptable overhead

After running that command, we now have an MTBL file. But how to read it?

Reading Our Sample MTBL File

Reading our sample MTBL file is pretty straight forward:

$ cat sample_read_mtbl.py
#!/usr/local/bin/python3
""" read a sample mtbl file """

import mtbl

### Run with:
###    ./sample_read_mtbl.py

def clean_up(list1):
    return str(list1).
           replace('[','').replace(']','').
           replace("'",'').replace('"','').replace(" ","")


r = mtbl.reader('sample.mtbl', verify_checksums=True)

key="0--0.de"
print("Sample read for key="+key)

# key may not exist; if that's the case, return empty data
try:
    results = clean_up(r[key])
except:
    results = ",,,,,"

result_list = results.split(",")
print(results)

A sample run:

$ ./sample_read_mtbl.py
Sample read for key=0--0.de
1,,,,0

Something to note about the above code:

  • We verify checksums; you may elect to forgo doing so.
  • We just pull a single hard-coded value from our sample file by way of example. You’ll likely loop through a file of inputs, retrieving a set of results based on those.
  • We’re doing minimal error handling/clean up, merely handling the case of a request for a key that doesn’t exist. We didn’t take any steps to do things like ensuring keys are normalized to all lowercase, just to mention one example of an additional step we might want to take.

Does This Work at Scale?

Remember, we started out with the premise that we wanted to build a sample MTBL file for hundreds of millions of records, not just a tiny toy sample file.

The file we’re going to try loading is around 10GB:

$ ls -lh all.data.csv
[...] 10459732114 [...] all.data.csv
$ wc -l all.data.csv
393111057 all.data.csv

It doesn’t take that long to load that file on a sample laptop:

$ time ./mtbl_writer.py < all.data.csv

real    16m19.975s
user    15m59.968s
sys     0m15.097s

The resulting file is substantially (~⅔) smaller than the original:

$ ls -l all_data.mtbl
[...] 3477694979 [...] all_data.mtbl

$ mtbl_info all_data.mtbl
file name:             all_data.mtbl
file size:             3,477,694,979
index block offset:    3,464,697,985
index bytes:           12,996,482 (0.37%)
data block bytes       3,464,697,985 (99.63%)
data block size:       8,192
data block count       908,845
entry count:           393,111,057
key bytes:             6,245,970,880
value bytes:           3,427,539,120
compression algorithm: zlib
compactness:           35.95%

What about throughput? We did a test of 2,539,048 sample queries – those ran in 88.57 seconds, translating to a rate of 2,539,048/88.57=28,667 queries per second, without any optimization.

Updating the MTBL File?

One thing to keep in mind about MTBL files: they’re immutable. Once you’ve created one, it cannot be “updated” without writing a new file (perhaps as part of an MTBL merge process).

In many cases, if you have new data (perhaps from a daily update cycle), the simplest solution will be to simply create an entirely new MTBL file to replace the old one. 

Enabling Query Access with Reduced Risk of Exhaustive Data Dumping

While MTBL files are in binary format and users typically access the MTBL file by querying for a specific key, you should be aware that MTBL files can be dumped to text format using the mtbl_dump command:

$ mtbl_dump all_data.mtbl

One approach you could employ to reduce any privacy risk from exhaustive data dumping would be to hash the key before writing the MTBL file, with queries against the MTBL file based on that hashed key.

Hashing the keys will come at some cost, however, since:

  • All the hashes will need to be computed, both prior to creating the MTBL file and for any subsequent queries made against the hashed MTBL file.
  • Hashed key values may be longer than original key values, increasing the overall size of the MTBL.
  • The hashed raw data will need to be re-sorted (so the hashed keys will be in the proper ascending order).
  • The hash data may not compress as well as the original data.
  • A determined party can still use probing to extract at least some fraction of the data.

For a discussion of some hashing options in Python3, see https://docs.python.org/3/library/hashlib.html

Conclusion

You’ve now seen how you can use MTBL files to efficiently provide a (read-only) Python3 dictionary-like interface to moderately large CSV files, even on typical laptop hardware. For those who are interested, there are additional examples of using pymtbl in the Github repository

Acknowledgements

We’d like to thank Sean McNee for the idea for this blog and Stephen Watt and Kelvin Dealca for their very helpful comments. Any remaining issues are solely the responsibility of the author.