I Was There: Stories of Production Incidents

Tools and Tips
November 12th, 2020
31:15

Also listen via

I Was There: Stories of Production Incidents

Hosted by Corey Martin, Meg Viar, Brendan Hennessy

Corey Martin values storytelling. It's just one way developers can share their experiences in order for others to take lessons. To that end, this episode takes a close look at production issues from two different applications to examine what went wrong and how it was fixed.

Meg Viar is a Senior Software Developer at Nomadic Learning, an e-learning platform. One day, they noticed that, for a certain group of users, a column of information in their database row was nulled. It didn't look like any user–either internally or externally–intentionally changed these values, and there hadn't been any new code deployed in days. The only clue was that the data was all changed at the same time. It turned out that a weekly cron job was deleting some data on an in-memory list. However, the database ORM they use also overloads the delete keyword, and was actually deleting the production data. Restoring the data from a backup was easy, and reworking the code to not use the data was a quick fix. However, going forward, Meg and her team came up with several ways to adjust the process around code changes like this from occurring again.

Brendan Hennessy is the co-founder and CTO at Launchpad Lab, a studio that builds custom web and mobile applications. One of their clients is an SAT/ACT test prep app, and students complained that the app was extraordinarily slow. Brendan was accustomed to seeing such feedback on testing days, when heavy volume brought added strain to servers, and they accounted for this by increasing capacity. But this was different: there weren't any tests scheduled during the period. Instead, one of their own services was inadvertently DDOSing an endpoint, expecting a response; when one didn't arrive, it just kept making requests. They reworked this code to make a request once and simply wait for a response without trying again. In the future, they committed themselves to doing more in-person blitzes of new features, since issues like this only arise after multiple users use the app–something automated tests have trouble simulating.

Links from this episode

Nomadic Learning builds digital academies
Launchpad Lab builds custom web and mobile applications for startups and established businesses

Show Notes

Corey:
I'm Corey Martin. Today, we're talking about production incidents, those scary times that your app doesn't work as you expect. We're bringing these stories out into the open so we can learn from them. In this episode, we'll talk to two developers about production incidents they were personally involved in. First up is Meg Viar, a senior software engineer at Nomadic Learning. Meg, welcome to Code[ish].

Meg:
Hi.

Corey:
So tell us a little about Nomadic Learning.

Meg:
Yeah. So Nomadic is a fully remote company. Our team comes from all around the world, from New York to California to Milan, and our big focus is we build digital academies. So we're trying to help individuals and organizations to adapt to different situations through an approach where we use continuous and collaborative learning. Obviously we are in really uncertain times right now, and a big part of what Nomadic does is help people to adapt to uncertain times. So it has been a very busy year. We've been creating a lot of new content, new programs for consumer based academies, as well as products for the different enterprises that we work with.
<!– more –>
Corey:
I imagine as more and more has moved online, you've seen an uptake in usage and it might increase the pressure you already have to keep everything online and working.

Meg:
Yeah, absolutely. We've definitely seen an uptick, not only in inquiries about new business, but also just in our user shift and like you said, right now things are all digital, so people are really relying on these digital pieces of their learning infrastructure to be online, be operational and be available, because there isn't really an alternative right now in person.

Corey:
What app are we talking about today?

Meg:
Yeah. So we're talking about the main platform app at Nomadic. So we have a small ecosystem that supports the different academies that we have. The app that we're going to talk about is one that powers the main learning platform for our end users. So that's a place where users are able to go to engage with our programs, as well as our more bite sized resource content and all the different interactive and collaborative elements that we have there.

Corey:
And what stack are you using?

Meg:
So for this app, our front end is in Vue.js and on the backend we're using Python and the Pyramid framework. And we also use MySQL and SQLAlchemy and SQLAlchemy is an ORM or object relational mapping, so that's basically just a way to communicate to the database in a way that's a little more human readable I suppose.

Corey:
And what happened?

Meg:
Yeah. So we found that for a certain group of users when they went into the app, they were basically seeing the app empty of any relevant data for them, which is obviously a problem. I mean, it was only affecting a group of users for a certain client. So we had our member of the customer success team relay this to our head of technology and initially there was a concern that maybe extensive data had been lost for this group. But eventually we were able to figure out that what had happened was for our user records, users in this certain group, one field in all of their records had been nullified. And this was a foreign key, which just means an ID basically that points from one record and represents a relationship to another record and there really was nothing that seemed like it could be responsible for this.

Meg:
There were no requests during that time that, from what we can understand would touch those records in that way and certainly not delete a foreign key and we hadn't made any deployments within the past few days and the most recent couple of deployments didn't seem like they would be related at all. But clearly this was something we wanted to head off in case it could happen again. So at the time, nothing stood out to me. I actually took a break, went and had lunch, I was like, "I need to step away from this because I don't understand how this could happen." I tried to start from scratch because I had really been banging my head against the wall a little bit trying to think how that can happen. So went back and started with what my assumptions were and what the facts were.

Meg:
So I knew that all of the foreign keys of a specific value had been nullified. To me that implied that some code had touched all of the user records in that group. From what we could tell there wasn't any sort of external attack or way that someone outside of the system, either an internal user or some external user had accessed the data, so this was something that was actually happening programmatically. So my question for myself was what code touches all of these? And so, I went back in and I reread our recap of what had happened and I noticed the time, it appeared that it had happened around 9:40 and I was starting to think, "Okay, there's a cron that runs 40 minutes after the hour." This is a classic case of being both the detective and the culprit, because this was a piece of code that I had touched not super recently, but maybe a week before the event had happened.

Meg:
So I went into that piece of code and when I realized the time and the connection, I thought, "Okay, I know that this does touch entire groups of users based on their edition." So basically we had made an update where we were sending out these emails and we had to break down the batch size of the emails we were sending because the messages started to get too long. So I had a little function that was batching up the users to send the emails to, and within that function, I was using Python's native delete to delete items from a list that was basically this is my list of users that I want to send this email to. As I sent them off to a batch, I was cutting down the list until the list was empty …

Corey:
Ah, okay.

Meg:
… which is fine. It's fine it's just an in memory list of dictionaries that represent user objects, but it turns out there was a little something more going on. So the ORM that we use, SQLAlchemy and as I mentioned earlier, that's just something that lets you communicate with the database rather than through raw SQL statements, you can write it with SQLAlchemy in a bit more Pythonish way. And that ORM actually alters the native delete keyword by putting on, I'm assuming some sort of listener that when you delete something that represents a relationship between records, it actually updates that data in the database without you explicitly doing so in the database session.

Corey:
That is a sneaky issue.

Meg:
It seemed sneaky to me, but in hindsight there are definitely ways that I could have avoided this.

Corey:
Let's talk about that then. So I imagine you fixed it within the code then and restored the data that had been altered?

Meg:
Yep. So the actual restoration was pretty trivial, because it only deleted all instances of one value of that foreign key. So basically we were able to say, "Anywhere where this foreign key is nullified, we can just put that value right back in." And then we were able to make that update to the code fairly quickly. Literally all we had to do was just cast that list, that instrumented list into a regular list that doesn't represent the relationships.

Corey:
Got it. So ultimately a code fix to prevent this from happening again. You also mentioned that you learned a few lessons from this that you apply on the team. Would you go into those a little bit?

Meg:
Yeah, absolutely. One thing for me that was a big lesson is that while an ORM can add some convenience, in our case, it does actually alter some of the native Python data structures and keywords. For me that led into a desire to better understand exactly what our ORM is doing and just because things look simple, to question your assumptions. Another lesson that we learned was that automated tests definitely won't catch everything. We actually sat down as a team and went through all the tests that we reasonably could have written or would have thought to write without knowing that this was going to be an issue and we couldn't really come up with any that would've caught this. Really, it would have just required running this code locally and then going into the app and just trying to use it as a user in that group.

Meg:
That was an issue there, knowing that maybe we needed to do a little more manual testing and not rely on automated testing. And another big one was making sure that every change that goes out, even one that seems really small, this seemed like a really small update, we were just going from sending everything at once to sending it in batches of a few hundred, but that little change actually had a big risk associated with it. So treating every change as you would perhaps releasing a major feature. And the last big one for me and something that I think we're maybe going to try to move to do more as a team is to actually have some element of a live code review.

Meg:
We tend to do our code reviews totally asynchronously. But I do think that in this case if we had sat down, this code was reviewed, the reviewer definitely knew that that was a potential issue, he has a really deep understanding of how the ORM works and he had even gone back and looked to make sure that they were simple lists, but one branch of the conditional had just slipped by. But I think that if we had actually sat down together, even if it was just 10 or 15 minutes and I had walked through the different branches of this change, I think we would've caught it.

Corey:
What advice would you have for other developers based on this experience?

Meg:
Yeah. So one thing that worked really well for us, I've actually recently been trying to read and learn more about DevOps. And so, something I've been reading about is blameless postmortems. This isn't something that we formally have been using as a team, but we essentially had a blameless postmortem after this happened, where basically we sat down and were like, "What happened? How did this happen? How can we make sure this wouldn't happen again?" And like I was saying about going through tests we could have written, whether or not that would have caught it, all those sorts of discussions came up.

Meg:
So I think a blameless postmortem is really important. It's easy to want to cast blame, I definitely got myself pretty worked up over this and was just beating myself up for letting something like that slide past. As people we like to blame, blame is a part of our nature and so, I have heard people say, "Having a blameless postmortem is probably impossible, but having a blame aware postmortem where you make yourself aware of the fact that that's a natural tendency and actively try to work against that," because ultimately in a complex system, mistakes do happen and figuring out how to learn from it is really the only way to approach it.

Corey:
Absolutely.

Meg:
So another thing that I learned from this was to make manual testing of new code more of a priority. We thought of all the ways an automated test might have copy issue. It was unlikely that it would have, but if I had done more extensive testing locally, I think it's something that would have revealed itself. And then the big one, I think that's my personal takeaway is that if there's something in your stack that does things automagically, where things are happening behind the scenes and you have very simple syntax that you can interact with and it just happens, make sure you understand how that magic actually does happen. So part of this I think is taking time to really understand the philosophy and the overall approach of your tech. That seems like something that's obvious but for me, I tend to like to jump into things and start working on them right away.

Meg:
Like when I came on at Nomadic, I went through the documentation, the original tutorial and things like that on our ORM before I ever started. And then I got to the job, hopped in and started working with it and never really returned to those fundamental pages in the docs. I would usually go into a problem, look up what I needed to look up to complete the task at hand and never really went back for more deep dives. That just in time learning can be really beneficial when you want to just hop in and get things done quickly. But I do think it has major drawbacks if you're not building up a foundation where what you don't know could definitely hurt you. So just in time learning has its place, but you also need to give yourself time and encourage your teammates as well to feel like they have the autonomy to take time to go back and review documentation even if you're familiar with working with it, just to continue to grow and deepen your understanding of how it works.

Corey:
You've really highlighted here the dependencies our applications have, the libraries that they use and sometimes the unexpected consequences of that and how educating yourself on what your libraries are actually doing and how they work might help avoid issues in the future. I think that's really, really good advice.

Meg:
It's definitely a lesson learned and it's good encouragement to take some time to learn a little more deeply. And I'm hoping that from this I can take some time to even read and do some writing about not only what happened, but also just helping others understand a little better how tools like SQLAlchemy work.

Corey:
Well, Meg, I've learned a lot. I really appreciate your candor in sharing the story, sharing what happened, how you discovered it, how you fixed it and the lessons that you took away. Thank you so much for being on Code[ish].

Meg:
Thanks Corey.

Corey:
The next incident we'll talk about is very different from Meg's, but some of the lessons are the same. Here's Brendan Hennessey, co-founder and CTO at Launchpad Lab. Brendan, welcome to Code[ish].

Brendan:
Thanks for having me.

Corey:
So tell me a little bit about Launchpad Lab.

Brendan:
Sure. We are a web and mobile development studio based out of Chicago, Illinois. We've been in business for about eight years now and build a custom web and mobile applications for startups to more established businesses looking to roll out a new product or a new revenue line is typically where we jump in to help.

Corey:
And what app are we talking about today?

Brendan:
We're going to be talking about an application that we built for a client of ours. They work in the SAT, ACT prep market, so to speak, and they have an online application to simulate the test taking experience for one of those standardized tests. We built the online tool to actually administer and take the test as well as some backend functionality to manage test credits and things like that.

Corey:
And what did you build it in?

Brendan:
Typically, at Launchpad we leverage a client, an API structure. Very frequently we tend to write our APIs in Rails and our front ends in React. And on the back end, we also leverage Postgres as are primary relational data store and did some Redis work for caching and background queuing.

Corey:
Okay. So what happened?

Brendan:
The applications heaviest usage tends to come when students are taking this diagnostic test. So we're trying to simulate the real experience as close as possible because we're trying to use this as a predictor of what your SAT or ACT score might be. What we'll tend to see in the application is period of quiet and then periods of heavy activity and that comes when many students are taking the diagnostic exam. So we received indirect reports that the application was slow, students were struggling to complete their tests, things were timing out and the Proctor of the exam and the students taking it were just essentially panicking because they're in the middle of a test and couldn't complete it. And we saw this in the narratives that we heard from users, but in some of our analytics tools, such as Heroku Metrics and New Relics were also telling us a similar story in a more technical sense, but the application wasn't performing well.

Corey:
So this is really high stakes for the test taker, for the Proctor and now for you when you're notified of an issue and you're not sure what's going on. What are the first steps you took to try to figure it out?

Brendan:
So, we took that narrative feedback and then cross-referenced that with some of the New Relic alerts we were seeing and hearing people say things are unresponsive or slow, we're looking for, typically on the backend something that has a high response time or it's very, very memory intensive, those are the things that would align with that story. So, we backed our way into this, so to speak. We're able to use New Relic's transaction monitoring tool, and you can look at things on the web dyno, which would be what the user's experiencing, there's a background dyno, but that's an async process that a user might not really have an issue with if something's taking a while. So we looked at the web dyno and you can sort by most time consuming and highest throughput endpoints. There was one called the update test section endpoint that was being hit very, very frequently, more so than we would've thought. And it was not as performant or it wasn't having a quick response time.

Corey:
So you zoomed in on that?

Brendan:
Yep, so we were like, "Okay, this is makes sense. This is a endpoint that should be hit during the test taking experience, but it's being hit more than we would have thought." And by being hit under such a load it's becoming less and less stable and performant and ultimately having a user impact.

Corey:
Okay. So you know where the problem is, how did you zoom in more and see, okay, why is this in point being hit so much and why is it getting slower?

Brendan:
Yeah. So we had to think about the user experience and what the student is going through. So during the test taking experience, let's just take a math test section for example. A student is given, let's say 20 questions to complete, and they have 20 minutes to complete it. In the UI we want a countdown timer so the student is aware of their pacing and how much time they have left to complete those questions. So in addition to showing that real time in the UI with a countdown timer, we also want the ability for them to pause and potentially come back to resume that section later. And that required us to persist to the backend and ultimately, Postgres in our use case, how much time was left. If we're trying to update that every second that could be a bottleneck on the server to be persisting that all the time. So we chose, I think maybe every 10 seconds we would persist it to the backend. So there's an interval timer happening on the UI to update the countdown, but we're not necessarily making an API call.

Corey:
I see, okay.

Brendan:
When we reach that 10 second threshold, what would happen is we were doing a comparison that said, "Anytime, the time elapsed is greater than the threshold, we should make an API call." This works under the assumption that you make that API call and you get a pretty fast response time where it comes back from the server and resets that 10 second interval. The problem though, is that that endpoint gets hit very, very frequently by many users and was slow in responding. So what would happen is once it reached or exceeded the threshold, it would make a call, then it would a second later make another call because it hadn't received a response yet, which would then slow down the previous response. And it's a recursive nightmare, so to speak where there are these API calls that were happening and ultimately resulted in the user, when they went to save their answers or navigate away, they couldn't persist or it would seemingly lock up on them, they weren't able to save their results.

Corey:
So if I'm a student, what I don't realize is that my browser is slamming your API more than it should.

Brendan:
Mm-hmm (affirmative).

Corey:
What I do realize is that I can't save my exam …

Brendan:
Correct, yes.

Corey:
… and that's when I go to you and say, "What's going on with the app?"

Brendan:
Yep. That's when you get the, your app doesn't work coming from users.

Corey:
Okay. So you figured out you had a recursive API call issue on every web client accessing your app, or at least most of them, how did you fix it?

Brendan:
So I think there's a couple ways to tackle the problem. So there's the volume problem, and then there's the response time problem. So yeah, we could work to optimize the response time of that endpoint that was maybe taking two or three seconds or something like that, will require decent rearchitecting because it was a shared endpoint that we were using for other needs. So that was a viable option, but would be tedious and might cause this ripple effect of breaking something else even if you have automated tests. The other approach would be, okay, how do we just reduce the frequency of API calls? If this is truly a slower endpoint, why are we hitting it so much? It's a relatively simple change. So on our UI, just instead of once we hit that threshold having a greater than comparison, we just did an equal comparison.

Brendan:
So essentially once we hit the threshold, we trigger the one API call and then if we continue to count above, it's been 10 seconds, 11, 12, we're not going to make another API call because of the equality comparison. Now, when the API does eventually respond, it resets the time elapsed back to 0, so that then you're counting back up to 10, and then that would trigger another API call. All the while we're still having the countdown timers and being reset this is just an internal clocking mechanism to throttle, so to speak, the number of API calls we're making. So we got that update in place, which reduced the volume of requests and being able to horizontally scale some of our dynos, we were able to between those two things handle the load it was under.

Brendan:
And then that gave us some more time to actually optimize the endpoint and make it more performant, which in general is always like a better move. We knew that would take us a bit longer and they talked about earlier, this is a panic, stressful situation for the people administering the exam, for the students trying to take it, so time's of the essence of pun intended, I guess. And yeah, we wanted something to at least stabilize things and not have it be a UX impact even if performance wise under the hood, things were still rough.

Corey:
You got it fixed. When you're having a retro looking back at this, what are the biggest lessons you learned?

Brendan:
Yeah. So I think we had the foresight to have those monitoring tools installed. So New Relic as our application monitor some of the Heroku Metrics, having those enabled, so that was very helpful and I think it's something that in a retro we're like, "Okay, that should be a standard set up for us," because otherwise we would have been solely relying upon the feedback from users, which only tells part of the story because we weren't receiving 500 errors or exceptions or something obvious, we weren't getting notifications in other error handling mechanisms. So that was, I think, a positive that we had that installed. But I think in reflecting back upon the regression testing we did and testing before launch, basically individual students would be testing it.

Brendan:
So, when that endpoint isn't hit as frequently, you don't really feel it. And hearing how the application was used, in reality, there tends to be batches of students that are taking the exam at the same time. Not necessarily one-off, you might get an even cadence of students taking the exam, it was like, okay, 15 people in this room are going to take it, so you get heavy load that way. So, understanding that that was going to be a more common environment in which the app would be operating, we had to find a way to simulate that and build that simulation into our testing and QA validation process. So just to get things up and running, we did a manual group testing session. As we stumbled upon this we affectionately call them testeronis at Launchpad where we get in a room, I guess, virtual room now maybe and say, "Okay, well let's have five or six of us test the application."

Corey:
Yeah. This is a tough one because you don't see it as an individual developer in there walking through the flow, you need multiple users. So the testeroni makes sense as a tool there to catch that weird error that you will only see when you have a bunch of people on it at the same time.

Brendan:
Yeah. And testeroni was our affectionate name to recruit people that worked on the project to get them intrigued by coming in and testing this with us.

Corey:
Make it cute and they'll want to participate.

Brendan:
Yeah, if you've got the right branding they will be onboard. So in addition to that manual approach, we looked at leveraging more automated tools that could allow us to regression test this. There's a tool we tend to have pretty good luck with, it's called Flood.io and that gives you the ability to simulate a user, so a student in this case, clicking through the UI in a test taking experience, and you can run concurrent sessions just depending upon your needs. The tricky thing with that is just having seed data that is separate for multiple concurrent users going through this because the data will fundamentally be mutated, if you're running these tests in parallel, they have to be isolated. So you're going to use the same environment because that's how it's going to be in production.

Brendan:
But you need to do some extra work on that backend to prep the seed data. It's not just about writing the test script or running through the UI, you also have to have the backend set up with the right test data, so it's a valuable tool for sure. But it comes with a time investment for the development team to have that script that they can run to say, "Okay, set up the staging environment with 10 students that have fresh exams, not been touched at all, and then make sure that those test IDs or session IDs that will be used for each student are inserted into the test script, so that you're simulating that versus they're clicking through already completed tests and you're not getting the same experience."

Corey:
Wow. Lots of good takeaways. I think testeroni is the word of the day. Somehow it makes me hungry, I don't know why.

Brendan:
I think of pepperoni.

Corey:
Okay. Pepperoni, that makes sense.

Brendan:
I'll have to ask the product manager on our team how he cooked that up.

Corey:
Sounds good. Well Brendan, thank you so much for sharing your experiences and the lessons you learned and advice, really appreciate you being on Code[ish].

Brendan:
All right, thanks a lot Corey, take care.

Corey:
These stories helped me question my assumptions. Maybe automated testing isn't enough and we need to think about manually testing new features as a team. Maybe our dependencies aren't working quite the way we think they are. This is the value of storytelling. When others shared their experiences, we take away lessons for our own projects, thanks to Brendan and Meg for sharing their incident stories and thank you for listening.

About Code[ish]

A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.

Subscribe to Code[ish]

Hosted By:

Corey Martin

Customer Solutions Architect, Heroku

with Guests:

Meg Viar

Senior Software Engineer, Nomadic Learning Company
@lmegviar

Brendan Hennessy

@bhenney