
Murphy’s Law: “Anything that can go wrong will go wrong.”
As a Data Systems Engineer, I write and maintain the systems we use to gather, process, and analyze our large datasets. Over the course of my career, I’ve written software for researchers, for small startups, for large defense companies, and for organizations in between, and I’ve learned that failure is inevitable. As DomainTools grows as an enterprise security company supporting our customers’ mission critical infrastructure, we need to plan for failure. We need to build fault-tolerant and scalable systems.
What does that mean? It means we design systems that expect failure. What kinds of failure? All kinds of failure. Machines fail, network problems exist, and humans make mistakes. The idea is that we can’t predict all the types of failure, so we want to design systems that fail quickly and recover gracefully.

In fact, we can increase system robustness when we let things fail. By asking the question “When components of our systems fail, how will we recover with or without human intervention?” we begin to build resilience.
Sometimes failure is:
And when things fail, we need to be able to recover. We have data persistence which means all our data is stored and archived. As we process our data, we keep copies of the original which allows us to recompute work that may have been lost during a failure. This requires more storage and may delay the availability, but the results are not lost.
This is just one of the many ways we work to make sure you have access to the data and services from DomainTools you have come to rely on.