114. Beyond Root Cause Analysis in Complex Systems

DevLife
April 27th, 2021
Episode 114
26:28

Also listen via

114. Beyond Root Cause Analysis in Complex Systems

Hosted by Marcus Blankenship, Robert Blumen

In this episode of Codeish, Marcus Blankenship, a Senior Engineering Manager at Salesforce, is joined by Robert Blumen, a Lead DevOps Engineer at Salesforce.

During their discussion, they take a deep dive into the theories that underpin human error and complex system failures and offer fresh perspectives on improving complex systems.

Root cause analysis is the method of analyzing a failure after it occurs in an attempt to identify the cause. This method looks at the fundamental reasons that a failure occurs, particularly digging into issues such as processes, systems, designs, and chains of events. Complex system failures usually begin when a single component of the system fails, requiring nearby “nodes” (or other components in the system network) to take up the workload or obligation of the failed component.

Complex system breakdowns are not limited to IT. They also exist in medicine, industrial accidents, shipping, and aeronautics. As Robert asserts: “In the case of IT, [systems breakdowns] mean people can’t check their email, or can’t obtain services from a business. In other fields of medicine, maybe the patient dies, a ship capsizes, a plane crashes.”

The 5 WHYs

The 5 WHYs root cause analysis is about truly getting to the bottom of a problem by asking “why” five levels deep. Using this method often uncovers an unexpected internal or process-related problem.

Accident investigation can represent both simple and complex systems. Robert explains, “Simple systems are like five dominoes that have a knock-on effort. By comparison, complex systems have a large number of heterogeneous pieces. And the interaction between the pieces is also quite complex. If you have N pieces, you could have N squared connections between them and an IT system.”

He further explains, “You can lose a server, but if you’re properly configured to have retries, your next level upstream should be able to find a different service. That’s a pretty complex interaction that you’ve set up to avoid an outage.”

In the case of a complex system, generally, there is not a single root cause for the failure. Instead, it’s a combination of emergent properties that manifest themselves as the result of various system components working together, not as a property of any individual component.

An example of this is the worst airline disaster in history. Two 747 planes were flying to Gran Canaria airport. However, the airport was closed due to an exploded bomb, and the planes were rerouted to Tenerife. The runway in Tenerife was unaccustomed to handling 747s. Inadequate radars and fog compounded a combination of human errors such as misheard commands. Two planes tried to take off at the same time and collided with each other in the air.

Robert talks about Dr. Cook, who wrote about the dual role of operators.
“The dual role is the need to preserve the operation of the system and the health of the business. Everything an operator does is with those two objectives in mind.” They must take calculated risks to preserve outputs, but this is rarely recognized or complemented.

Another component of complex systems is that they are in a perpetual state of partially broken. You don’t necessarily discover this until an outage occurs. Only through the post-mortem process do you realize there was a failure. Humans are imperfect beings and are naturally prone to making errors. And when we are given responsibilities, there is always the chance for error.

What’s a more useful way of thinking about the causes of failures in a complex system?

Robert gives the example of a tree structure or AC graph showing one node at the edge, representing the outage or incident.

If you step back one layer, you might not ask what is the cause, but rather what were contributing causes? In this manner, you might find multiple contributing factors that interconnect as more nodes grow. With this understanding, you can then look at the system and say, “Well, where are the things that we want to fix?”
It’s important to remember that if you find 15 contributing factors, you are not obligated to fix all 15; only three or four of them may be important. Furthermore, it may not be cost-effective to fix everything.

One approach is to take all of the identified contributing factors, rank them by some combination of their impact and costs, then decide which are the most important.

What is some advice for people who want to stop thinking about their system in terms of simple systems and start thinking about them in terms of complex systems?

Robert Blumen suggests understanding that you may have a cognitive bias toward focusing on the portions of the system that influenced decision-making.

What was the context that that person was facing at the time?
Did they have enough information to make a good decision?
Are we putting people in impossible situations where they don’t have the right information?
Was there adequate monitoring? If this was a known problem, was there a runbook?
What are ways to improve the human environment so that the operator can make better decisions if the same set of factors occurs again?

Show Notes

Speaker 1:
Hello and welcome to Code[ish], an exploration of the lives of modern developers. Join us as we dive into topics like languages and frameworks, data and event-driven architectures, and individual and team productivity, all tailored to developers and engineering leaders. This episode is part of our tools and tips series.

Marcus Blankenship:
Welcome to this episode of Code[ish]. I'm Marcus Blankenship, a Senior Engineering Manager at Salesforce. And today my guest is Robert Blumen, a Lead DevOps Engineer at Salesforce. Welcome Robert.

Robert Blumen:
Thanks Marcus.

Marcus Blankenship:
I'm really excited about this episode because it's near and dear to my heart. We are going to be talking about alternatives to root cause analysis, especially when problems happen and things go wrong. We're going to discuss common root cause analysis formats, and why they aren't the best way to go about thinking about complex system failures. And we're going to end with some thoughts about better ways to think about how to improve complex systems. So Robert, what is a root cause analysis?

Robert Blumen:
Okay, so root cause analysis would be different methods that people have of analyzing a failure after the fact to identify the cause. It may differ in domains. This is not only something we face in IT. As I looked into the literature about this, there are people in many different fields like medicine, industrial accidents, shipping, aeronautics, where you have what we call it an incident or a failure, something bad happened, something you didn't want.

Robert Blumen:
In case of IT it means when people can't check their email or they can't obtain services from a business. In other fields of medicine, maybe the patient dies, a ship capsizes, a plane crashes.

Robert Blumen:
You're talking about very serious outages or failures that can even result in loss of life. The assumption here is that the world is governed by laws of cause and effect. So if you understand the cause and effect that led up to this failure, and if there is such thing as the root cause, then you'd have an idea of how would we prevent this thing from happening again?

Robert Blumen:
How do we make some changes? You need to go through that analysis process and find out what is cause or causes, which led to that accident.

Marcus Blankenship:
Well, that sounds pretty reasonable. There's a lot of things in my life that are cause and effect. So what kind of steps might somebody today take when they're doing a root cause analysis? Are there some forms that are popular?

Robert Blumen:
These are popular in IT, something called The Five Whys. And the idea is if you ask, "Why did this incident happen?" You ask five times, then the fifth one is the root cause.

Robert Blumen:
This does make a certain amount of sense because, let's say we had an outage in our system. So why did this happen? The first initial thing is we find one of the servers went down. And clearly, that is relevant. But then you might say, "Well, aren't we supposed to have multiple servers so we can handle the load if one of them goes down. It would shift the load."

Robert Blumen:
So clearly just saying, "Well, why did this happen?" And the first thing you say is probably not a full understanding of why your outage happened. The idea of Five Whys is if you go back five levels deep then you will find something called the root cause.

Marcus Blankenship:
Is there anything special about the number five?

Robert Blumen:
No, it's completely arbitrary and that's one of the problems with this method.

Marcus Blankenship:
So I can see that would lead us to a deeper understanding. Why is it not a helpful way to reason about complex systems?

Robert Blumen:
One of the researchers in this field, Dr. Eric Hollnagel. He has a great side in one of his slide decks of the pervasiveness of the row of dominoes metaphor in media coverage and people writing about accidents.

Robert Blumen:
The idea that system is like five dominoes. And the way that failures occur is domino falls over it knocks over the subsequent four dominoes and the fifth domino being point where it becomes visible to the user or the customer. So if you walk five dominoes back, you find the first one and you're done.

Robert Blumen:
So all of the industries and the domains where people are concerned about accidents are what are called complex systems. And as I delved into the literature around this, which is broadly known as the new view of human error, I found this distinction between simple and complex systems, simple systems are like five dominoes.

Robert Blumen:
One, two, three, four, five, they fall. Complex systems you have a large number of heterogeneous pieces. And the interaction between the pieces is also quite complex. If you have N pieces, you could have N squared connections between them and an IT system.

Robert Blumen:
You could have N squared connections, but across each connection, you could have many different protocols. And a lot of the behaviors that we're interested in are really emergent properties of the system.

Robert Blumen:
You can lose a server, but if you're properly configured to have retries and round robin and your next level upstream should be able to find a different service. That's a pretty complex interaction that you've set up to avoid an outage.

Robert Blumen:
Now, the difference between simple complex systems and one of the researchers in the field, I'm going to quote someone is Kevin Heselin. He said simple systems fail in simple ways. Complex systems fail in complex ways.

Robert Blumen:
In the case of a complex system, generally, there is not one thing that was the root cause for a complex system to fail it, all of the defenses and retries and redundancy you built in, for some reason, it did not work.

Robert Blumen:
In order to understand what went wrong, generally you'll service or multiple things. And all of those things had to happen all at once to have a failure.

Robert Blumen:
The idea of multiple jointly contributing causes are the explanation of the failure, not one single root cause you focus on the one root cause you are missing a lot of these other jointly contributing causes to have a realistic understanding of why the failure occurred.

Marcus Blankenship:
So if we go back to simple systems for a moment, I'm imagining the line of dominoes, but any single line of dominoes, whether five or 50 or 500 is linear.

Marcus Blankenship:
So therefore it is a simple system. So we don't have to necessarily think that the system is small to be simple, it can be big, but it needs to exist in a certain configuration.

Marcus Blankenship:
Dominoes where one thing leads to another, but complex systems sound fundamentally different in this way. We've got so many different variables that just asking why five times isn't going to contribute to our understanding in a meaningful way. I'd love to hear an example of some failure of a complex system. Do you have any?

Robert Blumen:
There are a bunch and some of them are interesting ones are not in the information technology world. One of the more well-known examples that studied is the deadliest air traffic incident in history.

Robert Blumen:
It occurred in an island off somewhere of Europe called Tenor Reef. It was just some incredibly bad luck of a whole number of things which all happened at once. It started with, there were two, 747 that were not supposed to land there, but due to some kind of weather conditions, they both got rerouted to the same airport.

Robert Blumen:
The runway wasn't typically used to handling 747s. There was not normal 747 traffic there. So the air traffic controllers didn't have a great idea of how to guide those planes. There was some bad weather condition, but it gets way worse from there. The airport did not have a proper kind of radar that was used to guide these more modern planes.

Robert Blumen:
There were these two, 747s on the tarmac at once. There was some misunderstood commands between the air traffic control and the cockpits of the two planes.

Robert Blumen:
The pilots for some kind of fail, safe, failed were pilots and missed cutoffs in their route. The end of the story is the two planes, both tried to take off at the same time and collided with each other in the air.

Robert Blumen:
I hope that explanation gives you an idea of how many different things have to go wrong all at once. When you have problem domain like air traffic control, where there are so many built-in, fail-safes.

Robert Blumen:
I'm not expert in this, but I understand there's a lot of protocols between the cockpit and the controller to ensure that the instructions are properly understood. So that had to fail for this accident to happen.

Marcus Blankenship:
So there were many contributing factors. You listed a whole bunch, the weather, the fact, the airport wasn't meant to handle it. Human error, insufficient radar technology. So there were all these factors and had any one of them not been there. The outcome might've been different.

Robert Blumen:
That's absolutely right Marcus.

Marcus Blankenship:
I'm curious, you use the word emergent properties, complex systems have emergent properties. Could you tell us what that means?

Robert Blumen:
There's a property of the system as a whole, that is not a property of anyone particular part of the system. One great example of that comes from economics where the market price depends on the marginal buyer and the marginal seller, and then the supramarginal and submarginal buyers and sellers.

Robert Blumen:
In order to identify who those are, you need to look at the entire market and identify the bid and ask of every participant in order to identify which ones are marginal.

Robert Blumen:
Another example would be something we're concerned about in IT, which is the availability of a system we are used to now building systems out of unreliable components.

Robert Blumen:
We know that servers can go down and yet the availability of the entire system can still be much greater than the reliability of any single component.

Marcus Blankenship:
Interaction of components and those properties versus just how one component acts. If we go back to that airplane accident example, I feel like a traditional view of a problem like this, or a situation like this would be to start holding people accountable. To blame the pilot. I mean, they're the ones who were in control.

Marcus Blankenship:
Is that still a useful way to think about these kinds of problems?

Robert Blumen:
Generally not that particular point you're bringing up. It is very much emphasized in this literature.

Robert Blumen:
As I mentioned, it's called the new view of human error. The analysis of what went wrong, involves many components. And some of those components are decisions or actions made by a human. Humans do make errors, but as we've been discussing, the human role in the outcome is one of multiple contributing factors, which led to the final result.

Robert Blumen:
It's not the sole property. Some of the researchers in this field have suggested that there is a cognitive bias people have. If I presented to you, it's three or four different things that happened.

Robert Blumen:
Some hardware failed, there was bad weather. A plane got landed in the wrong airport and the air traffic controller screwed up. And you were asked, which is the cause that people are more inclined to focus on the human error, even though it may have been equally as important to other things, but not any more or less important.

Robert Blumen:
But there is a little bit of a deeper answer to that. One of the landmark papers in this area is by a Dr. Cook, who is an MD. It's a fantastic paper about complex systems.

Robert Blumen:
He talks about something, what he calls the dual role of operators. Now operators is a general term, for me, the people who try to cover for the fact that something went wrong and that in the end systems depend on people.

Robert Blumen:
We may set up rules like we have a cluster manager. If it sees a server that's unhealthy, it will pull it out and put a new one in. And to certain extent we do trust rules, but to really keep a system up, you need the human operator who can look at something and say, "I don't think the rules we set up are working the way they're supposed to."

Robert Blumen:
But there is a whole literature around the avoidance of nuclear war, which has happened a number of times because something on a screen that showed a nuclear attack was coming in and somebody whose job was to launch the counter strike said, "I don't think that's nuclear tag. And it turned out to be a flock of birds." So it's important that we have creative, smart people who are problem solvers to have a check on automation and the rules that we've built into systems.

Robert Blumen:
Now, what Dr. Cook talks about is this concept called the dual role of operators. The dual role is they need to preserve the operation of the system. We're a business. We need to keep the business I've been running.

Robert Blumen:
So the customers can obtain these valuable service that they've provided for. And we need to avoid errors. Every an operator does is with those two objectives in mind and everything operator does is a calculated risk because any change you make might succeed in meaning preserving outputs, or it might fail, which could mean making sure the system worse or causing an outage.

Robert Blumen:
And this is another one, one of these cognitive biases that the researchers have identified is that the time the [inaudible] is making a decision, taking a calculated risk to preserve outputs.

Robert Blumen:
They don't tend to get credit for that. But in those cases where they take a risk, right? They fail. We're very quick to say, "Hey Marcus, what were you thinking? Did you not know you were going to crash the entire system? If you change that configuration," but nine out of 10 times where you made a bunch of really great decisions, kept the system up.

Robert Blumen:
We don't have a post-mortem and say, let's look at how great a job Marcus did the last nine times. He was on call at keeping the system up. There's a little bit of maybe kind of unfairness after the fact, when we have more information at pointing at the person and saying you made an error.

Robert Blumen:
And maybe you did, but that's not a really taking into account the full complexity of a situation, which is that there's a lot going on and you were doing the best you could, and that it is your job to take calculated risks.

Robert Blumen:
And the reason that you were in that position, where you had to take a calculated risk is because other things were going wrong and you were trying to stop it.

Marcus Blankenship:
I feel like if this was a basketball metaphor, we would be criticizing the player that missed 1% of the shots. Rather than celebrating the fact that they made 99% of the shots. And they made the points there.

Robert Blumen:
Sure. Well, if you're in a game that is lost by one point, there were 100 other plays where somebody either made a basket or did not make a basket to get to that point where the score was tied. And so you can't just blame the one guy who missed the one shot at the last second of the game.

Marcus Blankenship:
That's a great point. And something else you said earlier that I hear a lot and it's that. And I think it's a place where the question really matters because you use the question hypothetically.

Marcus Blankenship:
Well, what was the cause of that problem? And just that word, the cause is a very singular focused like that. Infers, there must be one cause. And one, cause only that we have to go identify?

Robert Blumen:
We can understand the simple systems, the five dominoes, one of the properties about complex systems is no one can fully understand them. The failures tend to occur because of a cascading series of failures that no one had thought of.

Robert Blumen:
If you had thought of, well, A, B and C could all go wrong at once. And that would be a failure you might have put in some mitigation.

Robert Blumen:
So if those three things happened, it would not fail. Other times you would say the chance of all those things happening all at once is so remote. It's not cost effective to mitigate it. We'll live with that.

Robert Blumen:
That would be a business decision. And every system has an SLA and the SLA's in the industry, nobody strives for 100%.

Robert Blumen:
It's not achievable in our business, perhaps in some of these other fields where human life is at stake, they may be striving for 100%. In our business. It would not be cost effective and maybe not even possible.

Marcus Blankenship:
To go back to the airline accident analogy. You mentioned that one of the errors was that a pilot misheard instructions.

Marcus Blankenship:
And I have to be honest, I'm thinking about all the times every day, probably even as we speak, or as people are listening that some pilots somewhere is misinterpreting or missing some instructions from the tower.

Marcus Blankenship:
And yet I'm going to guess that doesn't result in a crash. I think I've heard that called… it's something like a latent failure.

Robert Blumen:
This is another term from the field of complex systems. A moment ago, I was telling you how no one fully understands complex systems. One of the consequences of that is they always have manifesting a subset of partial failures, at any time.

Robert Blumen:
Let's consider some kind of an outage where if you had to have five things go wrong all at once, you would have an outage.

Robert Blumen:
Now maybe three of those things have gone wrong. Nobody notices because it's not showing up. You don't monitor those things or they haven't produced an impact on something you do monitor, or you just changed something and it's broken something, but you haven't noticed it yet.

Robert Blumen:
These complex systems are always in a state of being partially broken. You don't necessarily discover that until there's an outage. And then you go back through the post-mortem and you realize you had a failure.

Robert Blumen:
There were five things that went wrong and three of them had been broken for weeks. And no one noticed it's fairly common in IT. You hear about a outage and some data was lost and people found, well, the backup hadn't run for two weeks.

Robert Blumen:
Someone had broken the backup script and maybe you didn't monitor [inaudible] the backup [inaudible] or maybe the thing that monitors the backup was also broken happens very commonly.

Robert Blumen:
So I'm going to go off on a tangent because your question did bring up another point. I was on a tech support call and in the past I've had the tech support read to me a certain key or password, and I'm trying to type it.

Robert Blumen:
What if I misheard it, I type it in wrong, nothing terrible would happen, but I wouldn't be able to get access to this resource in this call, the agent texted me the key and I pasted it into my form.

Robert Blumen:
And that avoided one source of human error. One of the reasons for human error is that the system puts people in a situation where they need to do things, which perhaps a person is not good at.

Robert Blumen:
So human error, it can result from humans being put in a position to do things which we are error prone at and whose fault is that it's not really the person's fault and maybe not the air traffic controllers fault.

Robert Blumen:
I was reading something recently, the air traffic controllers being required now to wear masks that work and that they are having more difficulty speaking clearly or being understood.

Robert Blumen:
And I guess you could argue whether you're mitigating some other risks by wearing masks, but putting people in a situation in which accuracy is impaired and that's not their fault.

Marcus Blankenship:
Well, we've talked a lot about what can go wrong and complex systems and simple systems, but let's turn our attention towards maybe better ways to do things. What's a more useful way of thinking about the causes of failures in a complex system?

Robert Blumen:
Asking why is important. And the overall guiding principle of cause and effect is also important. As I started reading about these accidents, I started making graphs. The key insight here is that the understanding of why something went wrong, it is not a link to list.

Robert Blumen:
Five dominoes would be a link plus it's more like a tree structure or an AC graph where you have one node at the edge, which is the outage or incident. Then if you step back one layer, you say rather than what is the cause, what were contributing causes to this?

Robert Blumen:
You might find one or two or three or some number. And then from each node in the graph, you would say, what are the causes or multiple contributing factors to this?

Robert Blumen:
And there could be any number two, three, four, five. Some of them may be things you already identified.

Robert Blumen:
And then you would draw rather than putting in a new node, you would draw a line from a node. You have to that, cause you'd already identified.

Robert Blumen:
You can go back, not necessarily five of those. You could go back three levels or six levels, however many levels, as long as you're still surfacing useful information. And then you'll have this graph and it might have, instead of five nodes, it might have 15 nodes.

Robert Blumen:
Then you can look at and say, "Well, where are the things that we want to fix?" You are not obligated to fix 15 things. If you find there are 15 contributing factors because maybe only three or four of them are important, or you may not have the money to fix everything and not be cost-effective to fix everything.

Robert Blumen:
But you can take all of the contributing factors that you've identified and rank them, and then decide, which are the most important ones to fix.

Robert Blumen:
Or you go to rank them by some combination of their impact and their costs. And let's say let's fix the most cost effective ones that I think is a better way of making systemic improvements to your system that will result in greater stability and avoidance of outages.

Marcus Blankenship:
I'm imagining your example earlier, when a system went down and the first you use the example of, if we ask why, once someone might say the server went down and I'm sort of seeing a branching there, possibility if like on one hand, we could say, well, why did the server go down? And that leads to a whole set of factors. And you said this earlier, the idea of, well, why didn't we have a backup server that leads to a whole different set of factors.

Marcus Blankenship:
Even in that simple example, I can immediately find two different lines of inquiry to start backing my way to understanding the factors that led to the outage.

Robert Blumen:
I was describing this to a friend and she pointed out, "Oh, so you could either go depth first or you could go breadth first." And since we're a programmers, we know how to traverse over trees. And that would give you a couple of different ways to do it.

Marcus Blankenship:
I also really liked that you pointed out that as you see all these factors waiting or sorting them either by what has the most likelihood to cause a big problem, what's easiest to fix what's cheapest to fix possibly. Those kinds of rankings and doing that multiple times, sorting that list differently will reveal sort of what are your top three action items or top end number of action items you could take to prevent this in the future?

Robert Blumen:
Yes.

Marcus Blankenship:
So you've kind of laid out a process for using a different way of thinking this tree model. Do you have any other steps or advice as people want to begin to stop thinking about their system in terms of simple systems and start thinking about them in terms of complex systems?

Robert Blumen:
One of the things that's in a lot of the literature about this new view of human errors in dealing with people to have an awareness that you may have a cognitive bias toward focusing on the portions of the system, where a person made a decision.

Robert Blumen:
And the idea that human error is kind of a label for the part of the system where the person is, did not work or was a contributing cause then you can ask what was the context that that person was facing at the time?

Robert Blumen:
Did they have enough information to make a good decision? Are we putting people in impossible situations where they don't have the right information in front of them?

Robert Blumen:
Was there adequate monitoring? Was there a runbook if this was a known problem and ways to improve the human environment so that the operator can make better decisions, if the same set of factors occurred.

Marcus Blankenship:
That's great. Robert, thank you so much for being on the show today.

Robert Blumen:
Thank you, Marcus.

Speaker 1:
Thanks for joining us for this episode of the Code[ish] podcast. Code[ish] is produced by Heroku, the easiest way to deploy, manage, and scale your applications in the cloud.

Speaker 1:
If you'd like to learn more about Code[ish] or any of Heroku's podcasts, please visit heroku.com/podcasts.

About Code[ish]

A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.

Subscribe to Code[ish]

Hosted By:

Marcus Blankenship

Sr. Engineering Manager, Heroku
@justzeros

with Guest:

Robert Blumen

Lead DevOps Engineer, Salesforce
@robertblumen