
Imagine that you’ve got four or five data points for every registrable domain on the Internet, keyed by the domain name. Such a file would be over 380 million rows in length, and perhaps 10GB in size, structured as a simple sorted comma-separated variable (CSV) file.
You’d like to be able to access entries in that file by key (e.g., domain name) on a Mac laptop in Python3, returning the associated data points much as you might use a Python3 dictionary for smaller data structures.
It turns out that MTBL files work great for this. This article explains how you can do that.
MTBL files are an implementation of immutable sorted string table (SST) files provided by Farsight Security, Inc. (now part of DomainTools)
If you don’t already have the MTBL package installed, see section II of “Passive DNS and SIE File Formats” for help doing so.
While MTBL files can be written and read directly using a “C” language interface, many data analysts prefer Python3. Fortunately, a Python3 interface to MTBL files is available.
Installing pymtbl requires cython. If you don’t already have cython, it can be installed using brew.
With libmtbl and cython installed, we should (finally!) be able to successfully install pymtbl with:
$ git clone https://github.com/farsightsec/pymtbl.git
$ cd pymtbl
$ sudo /usr/local/bin/python3 setup.py install
Assume our CSV file exists, is sorted, and has a unique key. For example, perhaps part of it looks like:
0--0.de,1,,,,0
0--0.dev,1,3,4,2,10
0--0.eu,1,,,,0
0--0.in,1,,,,18
0--0.jp,1,,,,32
0--0.link,1,,,,15
We’ll convert that file into an MTBL with the following code:
$ cat sample_write_mtbl.py
#!/usr/local/bin/python3
""" write a sample mtbl file """
### Run with:
### ./sample_write_mtbl.py < small-sample.csv
import os
import sys
import mtbl
mtbl_output_file='sample.mtbl'
try:
os.remove(mtbl_output_file)
except:
pass
w = mtbl.writer(mtbl_output_file, compression=mtbl.COMPRESSION_ZLIB)
for line in sys.stdin:
# split the input into fields
x = line.rstrip().split(',')
# our first field is going to be the key
key=x[0]
# we'll use the remaining five fields as comma separated variables
val= str(x[1:5]).
replace("]","").replace("[","").
replace("'",””).replace(" ","")
# this is all we need to write the MTBL
w[key]=val
Some things to note about that code:
After running that command, we now have an MTBL file. But how to read it?
Reading our sample MTBL file is pretty straight forward:
$ cat sample_read_mtbl.py
#!/usr/local/bin/python3
""" read a sample mtbl file """
import mtbl
### Run with:
### ./sample_read_mtbl.py
def clean_up(list1):
return str(list1).
replace('[','').replace(']','').
replace("'",'').replace('"','').replace(" ","")
r = mtbl.reader('sample.mtbl', verify_checksums=True)
key="0--0.de"
print("Sample read for key="+key)
# key may not exist; if that's the case, return empty data
try:
results = clean_up(r[key])
except:
results = ",,,,,"
result_list = results.split(",")
print(results)
A sample run:
$ ./sample_read_mtbl.py
Sample read for key=0--0.de
1,,,,0
Something to note about the above code:
Remember, we started out with the premise that we wanted to build a sample MTBL file for hundreds of millions of records, not just a tiny toy sample file.
The file we’re going to try loading is around 10GB:
$ ls -lh all.data.csv
[...] 10459732114 [...] all.data.csv
$ wc -l all.data.csv
393111057 all.data.csv
It doesn’t take that long to load that file on a sample laptop:
$ time ./mtbl_writer.py < all.data.csv
real 16m19.975s
user 15m59.968s
sys 0m15.097s
The resulting file is substantially (~⅔) smaller than the original:
$ ls -l all_data.mtbl
[...] 3477694979 [...] all_data.mtbl
$ mtbl_info all_data.mtbl
file name: all_data.mtbl
file size: 3,477,694,979
index block offset: 3,464,697,985
index bytes: 12,996,482 (0.37%)
data block bytes 3,464,697,985 (99.63%)
data block size: 8,192
data block count 908,845
entry count: 393,111,057
key bytes: 6,245,970,880
value bytes: 3,427,539,120
compression algorithm: zlib
compactness: 35.95%
What about throughput? We did a test of 2,539,048 sample queries – those ran in 88.57 seconds, translating to a rate of 2,539,048/88.57=28,667 queries per second, without any optimization.
One thing to keep in mind about MTBL files: they’re immutable. Once you’ve created one, it cannot be “updated” without writing a new file (perhaps as part of an MTBL merge process).
In many cases, if you have new data (perhaps from a daily update cycle), the simplest solution will be to simply create an entirely new MTBL file to replace the old one.
While MTBL files are in binary format and users typically access the MTBL file by querying for a specific key, you should be aware that MTBL files can be dumped to text format using the mtbl_dump command:
$ mtbl_dump all_data.mtbl
One approach you could employ to reduce any privacy risk from exhaustive data dumping would be to hash the key before writing the MTBL file, with queries against the MTBL file based on that hashed key.
Hashing the keys will come at some cost, however, since:
For a discussion of some hashing options in Python3, see https://docs.python.org/3/library/hashlib.html
You’ve now seen how you can use MTBL files to efficiently provide a (read-only) Python3 dictionary-like interface to moderately large CSV files, even on typical laptop hardware. For those who are interested, there are additional examples of using pymtbl in the Github repository.
We’d like to thank Sean McNee for the idea for this blog and Stephen Watt and Kelvin Dealca for their very helpful comments. Any remaining issues are solely the responsibility of the author.