Efficiently Accessing a Moderately-Large Sorted and Uniquely-Keyed CSV File in Python3 with MTBL
Imagine that you’ve got four or five data points for every registrable domain on the Internet, keyed by the domain name. Such a file would be over 380 million rows in length, and perhaps 10GB in size, structured as a simple sorted comma-separated variable (CSV) file.
You’d like to be able to access entries in that file by key (e.g., domain name) on a Mac laptop in Python3, returning the associated data points much as you might use a Python3 dictionary for smaller data structures.
It turns out that MTBL files work great for this. This article explains how you can do that.
MTBL files are an implementation of immutable sorted string table (SST) files provided by Farsight Security, Inc. (now part of DomainTools)
If you don’t already have the MTBL package installed, see section II of “Passive DNS and SIE File Formats” for help doing so.
While MTBL files can be written and read directly using a “C” language interface, many data analysts prefer Python3. Fortunately, a Python3 interface to MTBL files is available.
Installing pymtbl requires cython. If you don’t already have cython, it can be installed using brew.
With libmtbl and cython installed, we should (finally!) be able to successfully install pymtbl with:
$ git clone https://github.com/farsightsec/pymtbl.git $ cd pymtbl $ sudo /usr/local/bin/python3 setup.py install
Reading a Sorted CSV and Converting That into an MTBL File
Assume our CSV file exists, is sorted, and has a unique key. For example, perhaps part of it looks like:
0--0.de,1,,,,0 0--0.dev,1,3,4,2,10 0--0.eu,1,,,,0 0--0.in,1,,,,18 0--0.jp,1,,,,32 0--0.link,1,,,,15
We’ll convert that file into an MTBL with the following code:
$ cat sample_write_mtbl.py #!/usr/local/bin/python3 """ write a sample mtbl file """ ### Run with: ### ./sample_write_mtbl.py < small-sample.csv import os import sys import mtbl mtbl_output_file='sample.mtbl' try: os.remove(mtbl_output_file) except: pass w = mtbl.writer(mtbl_output_file, compression=mtbl.COMPRESSION_ZLIB) for line in sys.stdin: # split the input into fields x = line.rstrip().split(',') # our first field is going to be the key key=x # we'll use the remaining five fields as comma separated variables val= str(x[1:5]). replace("]","").replace("[",""). replace("'",””).replace(" ","") # this is all we need to write the MTBL w[key]=val
Some things to note about that code:
- We’re reading from sys.stdin, but always write to sample.mtbl. As part of doing so, we’ll remove any EXISTING sample.mtbl file
- We assume that the key is the domain name, in the first column, and that the file is sorted by it alphabetically
- We’re writing our values to the MTBL file as a single string. Because we anticipate writing literally hundreds of millions of records, we minimize the amount of formatting, stripping brackets and other formatting and keeping literally just comma separated variables
- If you prefer to write individual integer values, see https://github.com/farsightsec/pymtbl at “If you want to store integers in your mtbl use varint_encode() and varint_decode() or struct.pack and struct.unpack to do so”
- We use zlib for MTBL compression. Other options are available, but we find zlib provides some of the best compression with reasonable efficiency/acceptable overhead
After running that command, we now have an MTBL file. But how to read it?
Reading Our Sample MTBL File
Reading our sample MTBL file is pretty straight forward:
$ cat sample_read_mtbl.py #!/usr/local/bin/python3 """ read a sample mtbl file """ import mtbl ### Run with: ### ./sample_read_mtbl.py def clean_up(list1): return str(list1). replace('[','').replace(']',''). replace("'",'').replace('"','').replace(" ","") r = mtbl.reader('sample.mtbl', verify_checksums=True) key="0--0.de" print("Sample read for key="+key) # key may not exist; if that's the case, return empty data try: results = clean_up(r[key]) except: results = ",,,,," result_list = results.split(",") print(results)
A sample run:
$ ./sample_read_mtbl.py Sample read for key=0--0.de 1,,,,0
Something to note about the above code:
- We verify checksums; you may elect to forgo doing so.
- We just pull a single hard-coded value from our sample file by way of example. You’ll likely loop through a file of inputs, retrieving a set of results based on those.
- We’re doing minimal error handling/clean up, merely handling the case of a request for a key that doesn’t exist. We didn’t take any steps to do things like ensuring keys are normalized to all lowercase, just to mention one example of an additional step we might want to take.
Does This Work at Scale?
Remember, we started out with the premise that we wanted to build a sample MTBL file for hundreds of millions of records, not just a tiny toy sample file.
The file we’re going to try loading is around 10GB:
$ ls -lh all.data.csv [...] 10459732114 [...] all.data.csv $ wc -l all.data.csv 393111057 all.data.csv
It doesn’t take that long to load that file on a sample laptop:
$ time ./mtbl_writer.py < all.data.csv real 16m19.975s user 15m59.968s sys 0m15.097s
The resulting file is substantially (~⅔) smaller than the original:
$ ls -l all_data.mtbl [...] 3477694979 [...] all_data.mtbl $ mtbl_info all_data.mtbl file name: all_data.mtbl file size: 3,477,694,979 index block offset: 3,464,697,985 index bytes: 12,996,482 (0.37%) data block bytes 3,464,697,985 (99.63%) data block size: 8,192 data block count 908,845 entry count: 393,111,057 key bytes: 6,245,970,880 value bytes: 3,427,539,120 compression algorithm: zlib compactness: 35.95%
What about throughput? We did a test of 2,539,048 sample queries – those ran in 88.57 seconds, translating to a rate of 2,539,048/88.57=28,667 queries per second, without any optimization.
Updating the MTBL File?
One thing to keep in mind about MTBL files: they’re immutable. Once you’ve created one, it cannot be “updated” without writing a new file (perhaps as part of an MTBL merge process).
In many cases, if you have new data (perhaps from a daily update cycle), the simplest solution will be to simply create an entirely new MTBL file to replace the old one.
Enabling Query Access with Reduced Risk of Exhaustive Data Dumping
While MTBL files are in binary format and users typically access the MTBL file by querying for a specific key, you should be aware that MTBL files can be dumped to text format using the mtbl_dump command:
$ mtbl_dump all_data.mtbl
One approach you could employ to reduce any privacy risk from exhaustive data dumping would be to hash the key before writing the MTBL file, with queries against the MTBL file based on that hashed key.
Hashing the keys will come at some cost, however, since:
- All the hashes will need to be computed, both prior to creating the MTBL file and for any subsequent queries made against the hashed MTBL file.
- Hashed key values may be longer than original key values, increasing the overall size of the MTBL.
- The hashed raw data will need to be re-sorted (so the hashed keys will be in the proper ascending order).
- The hash data may not compress as well as the original data.
- A determined party can still use probing to extract at least some fraction of the data.
For a discussion of some hashing options in Python3, see https://docs.python.org/3/library/hashlib.html
You’ve now seen how you can use MTBL files to efficiently provide a (read-only) Python3 dictionary-like interface to moderately large CSV files, even on typical laptop hardware. For those who are interested, there are additional examples of using pymtbl in the Github repository.