Win by Building for Failure

Systems fail; it doesn’t matter what the system is. Something will fail sooner or later. When you design a system, are you focused on the happy path, or are you building with the possibility of failure in mind?

If you suffered a data breach tomorrow, what would the impact be? Does the system prevent loss by design, or does it just fall apart? Can you easily minimize loss and damage, or would an attacker have free rein once they get in? How would your customers or clients be impacted?

Going Down the Wrong (Happy) Path #

These are essential questions that should be asked when you are working on a new system, but too often, it’s the happy path that gets all the attention. That path where nothing is unusual or exceptional, where you get to show off your skills and make everyone excited. Here’s the thing about building a system that can withstand failure: most of the important work should never be noticed and never excite anyone — at least if you did the job right.

Designing for the happy path is easy, it’s comfortable, it makes for great demos and glowing reviews. But, unfortunately, it’s also a path to disaster. The reality is that things do go wrong; sooner or later, something will push you off that happy path.

If you want something that can stand the test of time, and give your customers the assurances they deserve, you need to build for failure.

Defining Building for Failure #

It’s all about defense in depth and resilience; it’s designing an architecture that anticipates failure modes and includes mitigations in the core of the design (not as an afterthought to please an auditor). Great systems fail gracefully, not just in the face of common issues, but also in the face of attackers.

While I may be biased, my favorite example is one of the more critical security features that my employer (1Password) uses: the Secret Key. This is a high-entropy string that is mixed into the authentication process by the client, and thus is required to access an account or derive the encryption keys to access data. This provides a very important security property: if someone gains access to the 1Password database, they have no hope of getting into a user’s data. Because 1Password never has the Secret Key, the database is useless alone — only when you have the data, the user’s password, and the Secret Key can data be accessed.

This is building for failure; while 1Password doesn’t expect to be breached (and never has been), the system is designed so that if it ever happened, users would be protected.

How this can be implemented obviously varies depending on the system, but using additional defenses such as cryptographic controls (i.e., keys that you don’t control) can add substantial protection. In addition, ensuring you have proper monitoring, alerting, and segregation of your infrastructure can slow an attacker down, or prevent them from pivoting to more valuable targets — and this has to apply as much to a rogue employee as it does to an outside attacker. There are a number of these steps that should be taken to ensure that if an attacker is able to get in, they don’t get anything valuable.

What’s important here though, is that it requires changing the way you think about threats to your customers in an important but non-obvious way — you are always a threat to them. When you design with that in mind, seeing yourself as one of the threats, you can build more robust systems that are far more likely to withstand attacks. It’s this perspective that allows you to see with better clarity how you can defend your customers most effectively. I’m not saying that you shouldn’t trust your employees; while insider attacks do happen, they aren’t that common. I’m arguing for a change in perspective so that it doesn’t matter who the malicious actor is.

When you define a threat model — even an informal one — always include yourself as one of the threat actors. While your intentions may be pure, there could be a breach of your systems, a rogue employee, demands from a hostile government, or a thousand other things that make you a genuine threat to your customers.

Honesty & Transparency #

During a job interview a few years ago, I was asked a question that I had never heard before; it struck me as interesting as there was only one possible answer¹.

When is it okay to lie to a customer?

I paused for a moment to think this through; the singular answer was so obvious that I wondered if I had missed something. I replied with “never,” and then we discussed the question at length — it turns out that a surprising number of people see no problem with deceiving customers, or identify countless exceptions or justifications.

Never, ever, for any reason at all, ever lie to a customer.

There is no easier way to destroy trust, to kill longstanding relationships, eliminate goodwill, and poison future prospects. Customers value honesty, and even moreso when it’s painful. It’s easy to lie, to keep secrets, to hide damaging information; while being honest can be quite difficult — but, as with all things in life, there’s a price to be paid for taking the easy option.

When something goes wrong, own it. Be transparent, be open, be honest, and give your customers the information they need². There may be pain in the short-term, but long-term, you earn respect for how you handle painful situations.

Blameless Remediation #

When something does go wrong, how do you respond? Is it a hunt for who to blame, or is it an opportunity to learn? Placing blame is not only ineffective, it’s actively harmful — people will protect themselves at the cost of their employer or users if they have to. They will shift blame, withhold information, hide errors, or even hide breaches when they think their job is on the line.

There are two ways to view these events:

A human error, where blame is assigned, and punishment is needed.
A systems failure, where learning and improvements are needed.

One of these leads to a loss of talent and a deeply hostile and dysfunctional environment, and the other leads to continuous improvement. If you are building with a long-term view, only one of these is rational.

Blame should never be placed on an individual³, but instead seen as a flaw in the system that allowed the error to progress to the point that it had an impact. If you call out those involved by name anywhere in the process, you are almost certainly doing it wrong.

While certain information is needed to ensure a proper understanding of an issue, such as the Pull Request that introduced the issue, the focus should not be on who was involved, but instead on the specific details of when it was introduced, if policies were followed, and what’s wrong that allowed the issue to go unnoticed. Again, the focus should not be on the individuals, but the process.

To foster an open and effective environment, everyone needs to feel comfortable presenting the bad news; nobody should ever feel fear when something goes wrong.

It’s a learning opportunity, and one too valuable to waste.

Conclusion #

While nobody expects to suffer a painful security failure, it’s vital to build for them from the beginning. By doing this, by seeing yourself as part of the threat model, you build a system that is far more able to withstand attacks than a system build under the assumption that everything will always go right.

It’s more work to build with failure in mind, progress seems slower, designs are more complex, you spend countless hours on details that will hopefully never matter — but in the moment they do matter, they will change everything. Sometimes engineering secure systems is thankless work, but it’s vital if you care about your customers.

I take a utilitarian view of the world, and try to apply this philosophy to all my actions. While automatically seeing a lie as morally wrong may seem closer to deontological than consequentialist, the long-term impact is almost always net-negative when considering second-order effects of a lie. A lie may seem to be a net-positive in the short-term as it may deflect initial negative reactions. However, in the long-term, it creates a cascading series of further deceptions, a significant risk of discovery, and a much stronger negative reaction when the deception fails — thus creating a more substantial net-negative impact than the truth possibly could. This has been your philosophy lesson for the day; thank you for attending. ↩︎
As I wrote in “Responsible Disclosure Is Wrong” the priority must always be making decisions in the best interest of the user. This is not always simple; one must balance the need to alert users, and the risk of enabling more attacks. This requires a careful case-by-case analysis based on the unique factors of the event, as no two security incidents are identical. No matter what is done though, the interest of the user must always be the singular driving factor. ↩︎
Some argue that those involved in an incident should be removed from their position, and excluded from future positions — this is a deeply flawed view, and one I addressed in “Best Practices vs Inane Practices” in particularly brutal form. ↩︎

Adam Caudill

Win by Building for Failure

Going Down the Wrong (Happy) Path #

Defining Building for Failure #

Honesty & Transparency #

Blameless Remediation #

Conclusion #

Related Posts

Best Practices vs Inane Practices

Threat Modeling for Applications

Hash Storage: Make Attackers Work

Ruby + GCM Nonce Reuse: When your language sets you up to fail…

On The Ethics of BadUSB

About Me