103. Chaos Engineering

Deeply Technical
December 17th, 2020
Episode 103
28:17

Also listen via

103. Chaos Engineering

Hosted by Rick Newman, Mikolaj Pawlikowski

Rick Newman interviews Mikolaj Pawlikowski, who recently wrote a book called “Chaos Engineering: Crash test your applications.” The theory behind chaos engineering is to “break things on purpose” in your operational flow. You want to deliberately inject failures that might occur in production ahead of time, in order to anticipate them, and thus implement workarounds and corrections. Typically, this practice is often used for large, distributed systems, because of the many points of failure, but it can be useful in any architecture.

One of the obstacles to embracing chaos engineering is finding high level approval from other teammates, or even managers. Even after the feature is a complete and the unit tests are passing, it can be difficult to convince someone that some resiliency work needs to continue, because there’s no visible or tangible benefit to preparing for a disaster. Mikolaj suggests that people clearly lay out to other colleagues the ways a system can fail, and the impact it can have on the application or business. Rather than try to fear monger, it can be useful to point to other companies’ availability issues as words of caution for their teams to embrace. Mikolaj also says that chaos engineering doesn’t need to focus solely on complicated problems like race conditions across distributed systems. Often, there’s enough low hanging fruit, such as disk space running out or an API timing out, that can be useful to consider fixing.

The chaos engineering mindset can also extend beyond pure software. If you think about people working across different timezones as a distributed system, you can also optimize for failures in communication before they occur. Everyone works at a different pace, and communication issues can be analogous to a network loss. Rather than fix miscommunications after they occur, establishing shared practices (like writing down every meeting, or setting up playbooks) can go a long way to ensuring that everyone will be able to do their best under changing circumstances.

Links from this episode

Mikolaj’s book is called Chaos Engineering: Crash test your applications — get a 40% discount using the code podish19
powerfulseal is a testing tool for Kubernetes clusters
Mikolaj distributes the Chaos Engineering Newsletter
Conf42 is a conference focusing on high-level computer science
ChaosConf is the world’s largest Chaos Engineering event
Awesome Chaos Engineering is a curated list of Chaos Engineering resources

Show Notes

Rick:
Hello, and welcome to the Heroku Code[ish] podcast. I'm your host today, Rick Newman, and I am here today with Mikolaj Pawlikowski, who has an upcoming book, Chaos Engineering: Site Reliability Through Controlled Disruption. Miko, thank you so much for joining us. And I wonder if you could just talk a little bit about yourself and a little bit about your upcoming book.

Mikolaj:
Sure. I'm really happy to be here. Thanks for hosting me. Like I said, I just finished my book. It's called Chaos Engineering: Crash Test Your Application I think they're going to change the title before it goes to print, but that's the temporary title for now. For those of you who have never heard of chaos engineering, you might have heard of things like chaos monkey. And probably if you Googled the term, you're going to come up with some kind of slogans like breaking things on purpose and stuff like that. But I guess engineering is just a practice of experimenting on a system and that system can be anything. It can be big, it can be massive. It can be tiny. You typically hear about the big ones because in the distributed systems, there's just more stuff that can go wrong and the practice of experimenting and to increase the likelihood of things recovering the way that you want them to recover and uncovering the things that don't recover the way you want them to recover is basically what we do with chaos engineering.

Mikolaj:
So the deliberate practice of injecting, the kind of failure that the real world is like to inject in your system to verify that your assumptions are correct. It's a really fine discipline, a lot of fun.
<!– more –>
Rick:
That sounds like a lot of fun. Who doesn't mind interrupting normal flow with a little bit of chaos now and then. I know that at certain organizations, certain engineering organizations, they have teams set up specifically for this purpose or maybe broader and larger tests. And that many, probably most, there are little pockets of individuals or practices that are going through this and using this to validate existing systems. Are there particular roles that are really prominent with chaos engineering or is it SRE or is it DevOps or is it more a kind of a holistic engineer that's doing feature work as well as testing and validation?

Mikolaj:
That's a good question. I think it's, kind of like an ongoing topic, a lot of SRE systems or SRE teams and DevOps teams, they kind of adopt that as part of their routine because it's extremely handy. If you're an SRE, then you care about reliability and all the tools that help you achieve that are your friends. So chaos engineering can be a really powerful tool to detect issues because with a pretty small investment of time if you know how your system is supposed to work, with a pretty small investment of time you can verify that it actually works in the kind of failure scenarios the way that you want while you're expected to work. So, yeah, I think I'm kind of seeing more and more of the SRE type of role people who adopt that just because it gives them so much benefit.

Rick:
Right. So it certainly seems well suited to SRE or DevOps types or operational roles but because as you've mentioned that the ability to span, the need to span systems, it seems like this would also be applicable to engineers in general and it might be not necessarily in those specific kinds of roles or not even a particular level of engineer, they would really be applicable to anyone. Is that right?

Mikolaj:
Exactly. Yeah. And it also is part of what actually makes it so much fun because, you're not working on just like kind of one domain. And when you're doing this chaos engineering kind of thing, you kind of cut across different stocks and different technologies and different disciplines, languages, whatever, and you kind of have to know enough about them to actually do that. So you learn a lot and like you said, it can kind of be applied in different teams and different levels.

Rick:
And where does this live? We might have some kind of hooks on check-in for some kind of validation maybe unit tests and integration tests and maybe somebody even hacking away at the interface somewhat randomly. Where do you see chaos engineering come in and what's the value out there?

Mikolaj:
Sure. So typically that's the biggest hurdle we have to go through when you actually need to generate that buy-in for doing chaos engineering, because it does sound kind of counterproductive at the beginning. So when you do your unit tests and you might achieve your hundred percent test coverage and put your badge on and be quite content with that, with integration tests you kind of go a level higher, right? You integrate a bigger chunk of code with whatever it is that external systems and dependent systems and whatnot it's interacting with. And chaos engineering is kind of the way I see it, is that it's kind of a level higher than that. You basically take the systems that run as a whole. Most of the people when they hear about chaos engineering they see Netflix, because they made it popular with chaos monkey and stuff like that. Right?

Rick:
Right.

Mikolaj:
So big distributed systems where you have a lot of moving parts and this kind of interactions between those moving parts are interesting for chaos engineering, because you can kind of inject a little bit of randomness. You can inject a little bit of failure here and there and verify what kind of effects that compounding kind of effect that might be spreading through your system. So it's an extra layer in my view on top of the unit tests and integration tests, where you tagged the systems as a whole, and you actually work on live system, unlike a small piece of growing unit tests or integration tests, you actually take this live systems. And obviously it kind of goes back to this discussion, whether you should be doing that in production or not but the big differences is that you kind of take this live system even if it's not production that has real traffic and you actually experiment on the thing as a whole.

Mikolaj:
A lot of time we discover things that are not working the way that you expect them to and failure that doesn't recover the way that you wanted to. And a lot of this is very simple stuff, just finding that has a lot of volume and there's obviously the aspect of finding that sooner than later, so that you can react quicker rather than running for a longer time with some kind of bug or some kind of unusual interaction between components that might produce results that you didn't expect. And also from a perspective of the people who actually do the work of fixing these things from the perspective of people who get cold at night, when they break it's really nice to find this kind of issue during the working hours, right? You do an experiment when you're in the office and when you have plenty of time left in the day to do something about that, other than at 2:00 AM, when it takes you like half an hour to even wake up and actually understand what's going on.

Rick:
Right. And I can relate having been on call for a good decade in the first part of my career. And as many have and are still on call and nothing tests, a playbook or a response, like trying to look through and figure out an issue 15 minutes after waking up. And it's before dawn and you're still blurry eyed. So I can see that that would be a pretty big draw for engineers that are on call and having to deal with regular and frequent off hours calls.

Mikolaj:
Yeah. I mean this argument typically just aligns so well with whoever was on the actual on-call rotation once in their life. I mean, for me, that's pretty much how we started doing chaos engineering, where we did for the context, I basically started on this project where we were working on getting our stuff integrated into Kubernetes. That's like a microservices platform and I think that was 2016 and like version 1.2 or something Kubernetes so it was pretty cutting edge and pretty unstable. And in our fixes were just flowing every day quicker than we could actually look at them.

Mikolaj:
So just to not get called too much at night, and just to have this kind of feeling that when something happens, you will be able to react quickly. That's really was the driving factor just getting called less at night, that was enough to get us into this game.

Rick:
Anything that let's me sleep through the night it gets my vote. In your research and talking to other individuals and other organizations, I'm sure you had some great real world stories about failures like this and maybe on call, but complex and distributed systems. That is really an injection of systemic disruption, an example of chaos engineering. Do you have any of those stories that you could share?

Mikolaj:
Yeah. I mean, this is typically kind of interesting aspect of this chaos engineering frame because a lot of practitioners they have things that they resolved, but once they resolve them, they don't necessarily want to talk about what this thing was. For my experience, a lot of these things are not sophisticated. There's a lot of low hanging fruit typically the things that get to us are restored conditions that didn't exactly align across, some kind of set of microservices and how to get out of sync because of that when they were starting or stuff like that, if you actually had an idea to go and check that you could probably just stare at it and figure it out but for some reason or another, I know it slipped through the review process and someone doing think about some kind of race condition across the different components.

Mikolaj:
So before you even get into this complex staff, when you're resolving some kind of sophisticated difficult race condition across distributed system, you can harvest so much of the low hanging fruit where just small things adopt to a disaster potentially. I tend to get that question a lot. And then when I was writing the book, my publisher was like, Hey, can we get some people to actually talk about the failures that they fixed this way? And it's a little tricky to get people on record on that.

Rick:
Okay. That's fair. I don't think I would necessarily want to share and publish the details of some of my past failures either. So I can get that and see where they would be coming from there. A lot of what we've talked about has been large and complex, more modern distributed systems and certainly where microservices and lots and lots of different options for hosting and processing and computing exists. I mean, is it only for those large and complex systems? Is there room for chaos engineering elsewhere?

Mikolaj:
I think kind of historically this entire thing started because of this increasing complexity of this large distributed systems with a lot of microservices where people moved from a bad, bad monolith to good, good microservices, and all of a sudden they have all kinds of issues on the connectivity and the retries and the backups and thundering herd programs and all of that. So I think there's definitely a lot of stuff that you can harvest just when you have kind of like distributed system. And with Kubernetes basically taken into work by the storm everybody wants to run on it now so I'm guessing we'll be seeing more and more of that. But in my book, I really tried to make this what I believe to be a strong case, that it really doesn't stop there, the same kind of principles and the same kind of mindset can be applied to pretty much anything I'm going so far down that path in the book that I'm trying to demonstrate that even if you had a legacy binary that you don't really understand, but you know something about how it's supposed to handle retries in one of the chapters, I'm basically suggesting we go through an example where you basically block some of the sys calls, some proportion of sys calls, I think there's a rise or something like that.

Mikolaj:
And then with something like that, you don't even need to see the code or understand the code base. You can just call it legacy and treat it as a black box, and you can still get value because of the fact that you injected that failure. Lets you actually verify that, okay, if this is failing to write in the network or whatever it was, you see how it actually behaves rather than reading the code and understanding okay, this is how it's supposed to be working. You actually verified it like an experiment. You put your scientist white coat on, rubber gloves and on game on, you actually verify that it's doing what it's supposed to be doing. So one of the things that I'm really trying to change about the kind of landscape of chaos engineering right now, if you're just searching on the internet, is that it really doesn't have to be massive distributed systems.

Mikolaj:
You can get a lot of value by doing reasonably simple things in much smaller things. Other examples, like for example, JVM, one of the cool things that you can do, it's part of JVM it's been there forever is that you can rewrite classes so you can inject things. You can inject a byte code during the runtime and just do something as simple as trigger some kind of exception that you're expecting to see and verify that all these dependencies that are built into that and all those thousands and thousands of lines of code–"enterprise ready"–actually do what you're expected to do in phase of some kind of simple failure and very often you find that it almost has it.

Rick:
Right. I see what you're saying. That's great. So it's not just these larger systems, but it's really even smaller systems even that we interact with every day that there's a lot there to dig into and with browser debuggers or anything even more local, we can apply these principles to really any software application that we interact with.

Mikolaj:
Yeah. Precisely or it goes further. If you log into your Gmail or whatever you're using, there's going to be plenty of JavaScript that's happening there. Why not just go and check what happens when that JavaScript is failing to fetch some data from the internet? Is it actually going to retry or alert you or tell you the right way or is it going to sneak in some stale data? And maybe when you click save, it's actually going to override your stuff with some stale data that it still has in the store. And there are tools for that. It takes like five minutes to identify what kind of calls it's making and then start playing with that and from there you can detect things and you will be surprised how much stuff you find.

Rick:
Great. So larger systems and smaller even local apps or black box testing, maybe things that we don't necessarily have access to. So it applies across a number of different domains or different applications. Is that everything, is that it, is it applicable to anything else? Is that where it stops?

Mikolaj:
Well, you're probably asking the wrong person, I just finished a book where I try to give examples of anywhere where we can apply that. I mean like one of my favorite kind of examples is that when you think about that, I need to give the credit where it's due. I actually got that idea from Dave Renson at Google in one of his presentations a couple of years ago. When you think about teams, these are just the same kind of distributed systems that you're doing work with your machines. There's are individuals who have their CPUs running at different speeds and have communication issues and lost acknowledgments across the network. If you think about that, the distributed systems that our teams are basically the same as the machines that you're dealing with, you have individuals who have their CPUs in their heads and they go at different speeds to communicate.

Mikolaj:
They miscommunicate, they have lost acknowledgements sent across the network. And you can basically apply a lot of these things to pretty much the same way. If you inject a little bit of failure here, for example, you tell this person to not acknowledge anything that will be evil. Tell them to say the wrong thing on a particular day and see how that information propagates and what happens to the system as a whole. And whether the systems in place actually catch that, where people on the team are like, yeah, I'll just do that and you raised the whole thing. So obviously you have to be careful with it, but when you get to the level of maturity and when you can do that with a team that actually enjoys that and sees value with this kind of games that you effectively play, it's super, super fun because it just makes sense.

Rick:
Yeah. I can see the need for high trust, a mature organization to be able to try something like that. Being able to have individuals or groups that you really know that know each other well to be able to test that out. That's a great idea. I think that'd be a fun thing to try out, probably something I'm going to have to ask my teams about and see what they would think about.

Mikolaj:
Definitely yeah.

Rick:
You mentioned even using a web debugger and being able to test local apps, and I guess that'd be kind of some gray box testing and using tools like that. What other tools are there that one could use to kind of start this practice?

Mikolaj:
One of the things that I try to do with the book again, is that I'm trying to not rely on any one particular tool too much, because a lot of these things you can do them fairly easily with existing software. Plenty of these tools have been forever available basically. A lot of the kind of work of chaos engineering is about observability, meaning, actually knowing how to reliably observe some characteristics, some quantity on the system. If you're running an experiment and you can't measure, whether it's worked or not properly, and it's no good, right, it probably does more harm than actually anything else. So a lot of these tools are focusing a lot on Linux in the book they've been there for a long, long time. And a lot of people are fairly familiar with them.

Mikolaj:
There are probably some know less of kind of popular flags and ways of using that, but it's pretty standard stuff. There are obviously kind of new tools coming out and they kind of cover the more advanced topics. I'm obviously bias because I wrote this one called powerful seal. If you search on GitHub, this one is kind of really focused on Kubernetes and getting the high level– this is, what I would like to inject into the system and please validate it this way, where you can just write a simple YAML file and get stuff to do it for you automatically, but you don't need to go that way. There are also commercial alternatives coming up, like gremlin that make it really easy to start, but at the heart of it, it's all about actually understanding why you're injecting that failure to begin with because a lot of people think that it's just about randomly breaking some stuff and seeing what happens.

Mikolaj:
And there is part of that, it's kind of like close to technique called fuzzing when you basically randomly do things and see what happens and this way you can find a lot of stuff it's definitely useful. And you can kind of do the same thing with chaos engineering. If you don't know where to start, starting with random is as good as any other start probably better in some cases. And it's definitely part of it but then the actual, the useful work that you do is when you understand the system pretty well, at least you understand how it's supposed to work in theory or some subset of the system that you're interested in, because then you can kind of reason about the different behaviors and the characteristics of the system that you expect to be there. And based on that, you can come up with experiments to verify that. Yeah. Okay. Well, this characteristic actually is there and it doesn't change through some kinds of failure scenario. You see what I mean? Right?

Rick:
Yes I do. Yeah. Fuzzing is a new term for me, but going beyond that sort of random testing is where the real value of chaos engineering comes in then. Using your own understanding of the system and where the edges are or where their failures could present is really what this is all about. And maybe in the former case where you don't know a lot about it, it can be useful to gain some of that info.

Mikolaj:
Yeah. I mean, it can be a good start if you don't know anything, if it's a black box to you, you might as well poke it and see what happens but then getting to the nitty gritty actually requires understanding. That just makes sense.

Rick:
Right. So get to know your systems and where those edges are. So obviously your book is an excellent resource and I'm a bit old fashioned in that I still love books. Are there other books or resources that someone could use to get more familiar with this concept? Where would you go to learn more?

Mikolaj:
Yeah. So there are at least two existing Riley books that you can take a look at. They cover much more the actual theory behind motivation–why would you do that. And the recent one actually has some pretty decent chapters written by people who applied to chaos engineering at their work. And that's definitely interesting. Otherwise, a good resource is the awesome chaos engineering list on GitHub. If you Google that it will come up. As far as I know, it's probably the best maintained resource in terms of just a list of artists links. So it has tools, it has some blog posts, conferences and stuff like that. So definitely recommend that one. Otherwise, I also have my own newsletter, at chaosengineering.news. You can sign up and get an occasional email with some news, I would say that's probably a good start. There also two conferences that you might be interested in Gremlin has ChaosCon once a year and I also run a conf42 chaos engineering that is actually in Europe.

Rick:
Okay. And for those interested in getting started with their local teams, once they've read your book, obviously, and maybe you started to explore some of those other resources, are there things they could do to get started and even to share some of their learnings with their local teams?

Mikolaj:
Yeah. That's actually a question that comes up pretty often to be honest and people tend to have a few ideas that kind of forbid themselves. Either they go to kind of a learning and kind of a thing. When they have a book club, they just kind of prepare a demo. Everybody's like, oh, wow. I didn't think of that. That's cool. Let's do something like that or a popular approach is that something actually breaks and then you can actually say point your finger other than say, oh, look, if only we did this little experiment that takes about two minutes to set up we wouldn't have that problem and you wouldn't be called, what do you think we should do? So that's kind of like a very powerful position. And I think a lot of places actually look at things like that, kind of in retrospective thinking, okay, how do we prevent things like that from happening again? You really don't need a lot of tools. You can start with a two line bash script. That's like a common culture if you're running a systems service and you have restart set to always, the always doesn't really mean always. That means that with the other parameters on default, that means that if it restarts, I think it's more than five times within a ten second moving window, then it won't actually be restarted.

Mikolaj:
So for those, who've run into stuff like that, all you need to do to fix that is change one other parameter but if you're just reading that, it's very easy to just say, oh, that actually looks like always. So it should always be restarted. So if you take like a bash script that has one line and executes the kill five times in a ten second window, you can find a problem like that. And if you have a technical lead who has a beard about half meter long and you show that to them, you can cause some real trouble in the team with that.

Rick:
Where would we even be without those individuals bearded or not? Right. Miko, this has been a fascinating chat and educational, certainly for me and I'm sure for others, thanks so much for taking some of your time to be on this episode of Code[ish] and to share a bit about chaos engineering and all the different contexts and ways that it can be useful. Teaching us more about our systems, even our teams, apparently. And for those listening, we'll put it in our notes, but you can use the podish19 at the Manning site for a discount on a Miko's new book.

Mikolaj:
Thanks a lot.

About Code[ish]

A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.

Subscribe to Code[ish]

Hosted By:

Rick Newman

Director of Engineering, Heroku

with Guest:

Mikolaj Pawlikowski

SRE Engineering Lead, Bloomberg
@mikopawlikowski