Engineers Corner

In the Shadow of Murphy’s Law: Designing for Failure

Written by:

Published on:

Nov 21, 2017

Murphy’s Law: “Anything that can go wrong will go wrong.”

As a Data Systems Engineer, I write and maintain the systems we use to gather, process, and analyze our large datasets. Over the course of my career, I’ve written software for researchers, for small startups, for large defense companies, and for organizations in between, and I’ve learned that failure is inevitable. As DomainTools grows as an enterprise security company supporting our customers’ mission critical infrastructure, we need to plan for failure. We need to build fault-tolerant and scalable systems.

What does that mean? It means we design systems that expect failure. What kinds of failure? All kinds of failure. Machines fail, network problems exist, and humans make mistakes. The idea is that we can’t predict all the types of failure, so we want to design systems that fail quickly and recover gracefully.

In fact, we can increase system robustness when we let things fail. By asking the question “When components of our systems fail, how will we recover with or without human intervention?” we begin to build resilience.

Sometimes failure is:

Driven by success. An increase in traffic to a website can bring down a service. We plan for this by serving our website behind a load balancer. This distributes the load to multiple servers and decouples the server from the client. We can add or remove servers to adapt to changes in load and to update services without reducing availability.
Outside our control. We have designed redundancy in our networking. We have multiple carriers who provide us access to the internet to provide alternate routing for network failures. Our switches are setup to automatically failover to a secondary when there is congestion on the primary route allowing for shorter disruptions.
Temporary. Our services are written with retry logic. When a remote call fails, our code waits and automatically tries again. We have added timeout logic, so if we don’t receive an answer in a timely manner, we move onto plan B.

And when things fail, we need to be able to recover. We have data persistence which means all our data is stored and archived. As we process our data, we keep copies of the original which allows us to recompute work that may have been lost during a failure. This requires more storage and may delay the availability, but the results are not lost.

This is just one of the many ways we work to make sure you have access to the data and services from DomainTools you have come to rely on.