Adam Caudill

Security Engineer, Researcher, & Developer

Win by Building for Failure

Systems fail; it doesn’t matter what the system is. Something will fail sooner or later. When you design a system, are you focused on the happy path, or are you building with the possibility of failure in mind?

If you suffered a data breach tomorrow, what would the impact be? Does the system prevent loss by design, or does it just fall apart? Can you easily minimize loss and damage, or would an attacker have free rein once they get in? How would your customers or clients be impacted?

Going Down the Wrong (Happy) Path #

These are essential questions that should be asked when you are working on a new system, but too often, it’s the happy path that gets all the attention. That path where nothing is unusual or exceptional, where you get to show off your skills and make everyone excited. Here’s the thing about building a system that can withstand failure: most of the important work should never be noticed and never excite anyone — at least if you did the job right.

Designing for the happy path is easy, it’s comfortable, it makes for great demos and glowing reviews. But, unfortunately, it’s also a path to disaster. The reality is that things do go wrong; sooner or later, something will push you off that happy path.

If you want something that can stand the test of time, and give your customers the assurances they deserve, you need to build for failure.

Defining Building for Failure #

It’s all about defense in depth and resilience; it’s designing an architecture that anticipates failure modes and includes mitigations in the core of the design (not as an afterthought to please an auditor). Great systems fail gracefully, not just in the face of common issues, but also in the face of attackers.

While I may be biased, my favorite example is one of the more critical security features that my employer (1Password) uses: the Secret Key. This is a high-entropy string that is mixed into the authentication process by the client, and thus is required to access an account or derive the encryption keys to access data. This provides a very important security property: if someone gains access to the 1Password database, they have no hope of getting into a user’s data. Because 1Password never has the Secret Key, the database is useless alone — only when you have the data, the user’s password, and the Secret Key can data be accessed.

This is building for failure; while 1Password doesn’t expect to be breached (and never has been), the system is designed so that if it ever happened, users would be protected.

How this can be implemented obviously varies depending on the system, but using additional defenses such as cryptographic controls (i.e., keys that you don’t control) can add substantial protection. In addition, ensuring you have proper monitoring, alerting, and segregation of your infrastructure can slow an attacker down, or prevent them from pivoting to more valuable targets — and this has to apply as much to a rogue employee as it does to an outside attacker. There are a number of these steps that should be taken to ensure that if an attacker is able to get in, they don’t get anything valuable.

What’s important here though, is that it requires changing the way you think about threats to your customers in an important but non-obvious way — you are always a threat to them. When you design with that in mind, seeing yourself as one of the threats, you can build more robust systems that are far more likely to withstand attacks. It’s this perspective that allows you to see with better clarity how you can defend your customers most effectively. I’m not saying that you shouldn’t trust your employees; while insider attacks do happen, they aren’t that common. I’m arguing for a change in perspective so that it doesn’t matter who the malicious actor is.

When you define a threat model — even an informal one — always include yourself as one of the threat actors. While your intentions may be pure, there could be a breach of your systems, a rogue employee, demands from a hostile government, or a thousand other things that make you a genuine threat to your customers.

Honesty & Transparency #

During a job interview a few years ago, I was asked a question that I had never heard before; it struck me as interesting as there was only one possible answer1.

When is it okay to lie to a customer?

I paused for a moment to think this through; the singular answer was so obvious that I wondered if I had missed something. I replied with “never,” and then we discussed the question at length — it turns out that a surprising number of people see no problem with deceiving customers, or identify countless exceptions or justifications.

Never, ever, for any reason at all, ever lie to a customer.

There is no easier way to destroy trust, to kill longstanding relationships, eliminate goodwill, and poison future prospects. Customers value honesty, and even moreso when it’s painful. It’s easy to lie, to keep secrets, to hide damaging information; while being honest can be quite difficult — but, as with all things in life, there’s a price to be paid for taking the easy option.

When something goes wrong, own it. Be transparent, be open, be honest, and give your customers the information they need2. There may be pain in the short-term, but long-term, you earn respect for how you handle painful situations.

Blameless Remediation #

When something does go wrong, how do you respond? Is it a hunt for who to blame, or is it an opportunity to learn? Placing blame is not only ineffective, it’s actively harmful — people will protect themselves at the cost of their employer or users if they have to. They will shift blame, withhold information, hide errors, or even hide breaches when they think their job is on the line.

There are two ways to view these events:

  1. A human error, where blame is assigned, and punishment is needed.
  2. A systems failure, where learning and improvements are needed.

One of these leads to a loss of talent and a deeply hostile and dysfunctional environment, and the other leads to continuous improvement. If you are building with a long-term view, only one of these is rational.

Blame should never be placed on an individual3, but instead seen as a flaw in the system that allowed the error to progress to the point that it had an impact. If you call out those involved by name anywhere in the process, you are almost certainly doing it wrong.

While certain information is needed to ensure a proper understanding of an issue, such as the Pull Request that introduced the issue, the focus should not be on who was involved, but instead on the specific details of when it was introduced, if policies were followed, and what’s wrong that allowed the issue to go unnoticed. Again, the focus should not be on the individuals, but the process.

To foster an open and effective environment, everyone needs to feel comfortable presenting the bad news; nobody should ever feel fear when something goes wrong.

It’s a learning opportunity, and one too valuable to waste.

Conclusion #

While nobody expects to suffer a painful security failure, it’s vital to build for them from the beginning. By doing this, by seeing yourself as part of the threat model, you build a system that is far more able to withstand attacks than a system build under the assumption that everything will always go right.

It’s more work to build with failure in mind, progress seems slower, designs are more complex, you spend countless hours on details that will hopefully never matter — but in the moment they do matter, they will change everything. Sometimes engineering secure systems is thankless work, but it’s vital if you care about your customers.

  1. I take a utilitarian view of the world, and try to apply this philosophy to all my actions. While automatically seeing a lie as morally wrong may seem closer to deontological than consequentialist, the long-term impact is almost always net-negative when considering second-order effects of a lie. A lie may seem to be a net-positive in the short-term as it may deflect initial negative reactions. However, in the long-term, it creates a cascading series of further deceptions, a significant risk of discovery, and a much stronger negative reaction when the deception fails — thus creating a more substantial net-negative impact than the truth possibly could. This has been your philosophy lesson for the day; thank you for attending. ↩︎

  2. As I wrote in “Responsible Disclosure Is Wrong” the priority must always be making decisions in the best interest of the user. This is not always simple; one must balance the need to alert users, and the risk of enabling more attacks. This requires a careful case-by-case analysis based on the unique factors of the event, as no two security incidents are identical. No matter what is done though, the interest of the user must always be the singular driving factor. ↩︎

  3. Some argue that those involved in an incident should be removed from their position, and excluded from future positions — this is a deeply flawed view, and one I addressed in “Best Practices vs Inane Practices” in particularly brutal form. ↩︎

Adam Caudill

Related Posts

  • 1Password 8 Early Access: Security, Comments, & FAQs

    A few days ago, 1Password (my employer) released the first preview of the new application for macOS. The response has been rather dramatic. The release was followed by an excellent blog post by Michael Fey explaining the story of how we got here, and some of the decisions that were made in the process. I’d like to now to a few minutes to answer some questions, provide some insight, and share my thoughts on this release.

  • Declaring War on Ransomware

    It’s time for everyone from the industry, developers, and the government to declare war on ransomware and make it as hard as possible for them to ply their insidious trade. There have been false starts and baby steps, diligent fighters without enough resources, and vendors that have only given a nod to the issue. It’s time to use every tool reasonably available to stop this scourge. For so many in the industry that have dedicated so much of their time and effort to this fight, this statement may seem to diminish their efforts, but that is not my intent.

  • Utilitarian Nightmare: Offensive Security Tools

    Or: Ethical Decision Making for Security Researchers There has been much discussion recently on the appropriateness of releasing offensive security tools to the world – while this storm has largely come and gone on Twitter, it’s something I still find myself thinking about. It boils down to a simple question, is it ethical to release tools that make it easy for attackers to leverage vulnerabilities that they wouldn’t otherwise be able to?

  • Breaking the NemucodAES Ransomware

    The Nemucod ransomware has been around, in various incarnations, for some time. Recently a new variant started spreading via email claiming to be from UPS. This new version changed how files are encrypted, clearly in an attempt to fix its prior issue of being able to decrypt files without paying the ransom, and as this is a new version, no decryptor was available1. My friends at Savage Security contacted me to help save the data of one of their clients; I immediately began studying the cryptography related portions of the software, while the Savage Security team was busy looking at other portions.

  • Looking for value in EV Certificates

    When you are looking for TLS (SSL) certificates, there are three different types available, and vary widely by price and level of effort required to acquire them. Which one you choose impacts how your certificate is treated by browsers; the question for today is, are EV certificates worth the money? To answer this, we need to understand what the differences are just what you are getting for your money. The Three Options For many, the choice of certificate type has more to do with price than type – and for that matter, not that many people even understand that there are real differences in the types of certificates that a certificate authority (CA) can issue.