I Was There: Stories of Production Incidents II

Tools and Tips
January 19th, 2021
29:09

Also listen via

I Was There: Stories of Production Incidents II

Hosted by Corey Martin, Ifat Ribon, Christopher Ostrowski

Corey Martin leads the discussion with two developers about production incidents they were personally involved in. Their goal is to inform listeners on how they discovered these issues, how they resolved them, and what they learned along the way.

Ifat Ribon is a Senior Developer at LaunchPad Lab, a web and mobile application development agency headquartered in Chicago. For one of their clients, they developed an application to assist with the scheduling of janitorial services. It was built with a fairly simple Ruby on Rails backend, leveraging Sidekiq to process background jobs. As part of its feature set, the app would send text messages to let employees know their schedule for the week; these schedules were assembled by querying the database several times, fetching frequencies and availabilities of workers. Unfortunately, a client noticed a discrepancy between how many notices were being sent out, versus how many jobs they knew they had: of the 400 jobs total, only 150 had notifications. It turned out that all of the available database connections were being exhausted–but that was only half of the issue. Sidekiq was attempting to process far too many jobs at once, and each of these jobs were responsible for connecting to the database, exhausting the available pool. The solution Ifat settled on was to reduce the number of parallel jobs processed while increasing the number of connections to the database. From this experience, she also learned the importance of understanding how all these different systems interconnect.

Christopher Ostrowski is Chief Technology Officer at Dutchie an e-commerce platform for the cannabis industry. One Christmas Eve, while celebrating with his family, Chris began receiving pager notifications warning him about some sluggish API response times. Since it didn’t really have any significant end user impact, he ignored it and went back to the festivities. As the night went on, the warnings became significant alerts, and he pulled together a response team with colleagues to figure out what was going on. By all accounts, the website was functioning, but curiously, the rate of orders began to drop off. Through some investigation, they realized what was going on. Customers’ order numbers were assigned a random, non-sequential six digit numbers. Dutchie was about to track its one-millionth order, a huge milestone. Before any orders are created, though, the app generates a six digit number, and tries to create one that doesn’t already exist. The database was constantly being hit, as less and less six digit numbers were available for use. The solution ended up being rather simple: the order number limit was increased to nine digits. Although they had monitoring in place, the data was set up as an aggregate reporting; even though the “create order” API was slow, all of the others were low, keeping the average within tolerable levels. Christopher’s solution to avoid this in the future was to set up more groupings for “essential” API endpoints, to alert the team sooner for latency issues on core business functionality.

Links from this episode

LaunchPad Lab is a web and mobile application development agency
Dutchie is an e-commerce platform for the cannabis industry

Show Notes

Corey:
I'm Corey Martin. This is the second episode in the I Was There series. We're talking about production incidents, those scary times when your app doesn't work as you expect. Our guests are bringing these stories out into the open, so we can learn from them. We'll talk to two developers about production incidents they were personally involved in. First up is Ifat Ribon, a Senior Developer at LaunchPad Lab. Ifat, welcome to Code[ish].

Ifat:
Thank you. Thanks for having me.

Corey:
So, tell us a little bit about LaunchPad Lab.

Ifat:
Yeah. LaunchPad Lab is a web and mobile application development agency. We're headquartered in Chicago. We work with a variety of clients in the generally mid-sized business range, working from businesses that are just startups to also more mature corporations, looking to evolve their product offerings into a more digital space.

Corey:
What app are we talking about today?

Ifat:
So today we're going to talk about a problem I faced with an app that was developed for a client of ours that is a janitorial services company. They contract with large universities all around the United States. Their contracts require this client to be responsible for the hiring of the janitors and then dispatching them and scheduling sort of the tasks, sort of cleaning up the rubbish and mopping the floors and things like that to each of the universities they work with.
<!– more –>
Corey:
What is the architecture of the app and the stack?

Ifat:
Yeah. So, this application, which was built to sort of help them manage that schedule, part of their requirements was built with a Rails API backend and React for the frontend. They actually originally started with Rails both for back and frontend, and it was a migration we ended up doing a little bit later in the app so that we could take advantage of some of the slick interactions you can use with React.

Corey:
What story are you sharing with us today?

Ifat:
Yeah. So, this client, like I said, came to us to help them build an app that would help them build the schedules they were currently using Excel to put together. So, they've got, like I said, employees all over the country, working on these universities. Every day, these schedules might be a little bit different. Some tasks are done on a daily basis. Some are done on a weekly basis. Some are done on a more periodic basis either monthly or sort of more infrequently than that. They needed a way to put those schedules together in a more programmatic way and then also, they wanted to deliver those schedules to employees via SMS.

Ifat:
Essentially, employees would get a notification about an hour before their shift reminding them, "Hey, you've got these tasks today," and it would be tailored to exactly what they had to do in that day. They could do those tasks, and then about an hour after the shift, the employee would get another text saying, "Hey, how'd it go today?" They could respond with, "Yep, things went well," or, "No, I faced these issues." It was a nice way for the client to get a sort of a level of engagement with their employees even if they're not interacting with them in person day to day.

Ifat:
So, the issue was that in order to be able to send out those text messages that could let employees know exactly what their tasks were for the day, we have an overnight Rake tasks or like a cron job that runs to … In the database persist a snapshot of an employee schedule for that day based on the schedule that a manager might have created inside the app. Now, there's about, on a weekday, about 400 or more schedules that need to be created for a given day. It's a lot of schedules, and they're quite complex. There's a lot of queries in terms of fetching tasks by sort of frequency. I mentioned daily, weekly, monthly, and sort of putting those together. So, there's a lot of queries, a lot of records and a lot of touches to the database. It turns out though what came to happen was the task that originally was written wasn't completing correctly. These shifts actually weren't properly getting created for all 400 or so employees.

Corey:
How did you first notice that?

Ifat:
Yeah. So, a couple things happened at the same time. One, we do use an error alert or error monitoring tool called Sentry that notifies us of code level or application level errors. We were getting a set of notifications around database connections being full. So, simply, when we run this Rake task and we're hitting the database for each of these 400 schedules … At a given time, a database can be configured to only have certain number of pools open, and that's so you're not overloading your database, right? Default on Rails is five, so you can't necessarily ask for all those 400 at the same time.

Ifat:
Figuring that sort of more code level alert, we also unfortunately … You don't usually want this to happen, but the client also sort of tapped us and has said, "You put together this nice job report for us as one of the features, and this job report was meant to illustrate how many of these text messages were going out and how many of our employees are responding and how many jobs are getting completed, but these numbers don't seem to match up with how many jobs we know we have at our company on a given day. Can you look into that?"

Ifat:
So, from there, first, we check the report just to make sure all the numbers were at least telling out what we would expect. We came to find out that, coupled with the error alert about the database pool limits that in fact, not all 400 shifts were getting created. Only actually about … When I first looked into the issue, I saw it made it through maybe like 150 of the jobs.

Corey:
So, at this point, if I'm an employee, I might not be receiving a text that I need. If I'm the employer, I know something's wrong because it's showing up in the reporting that there's not the expected level of engagement with the text. If I'm you, I'm trying to figure out what's causing this. I see that we have some errors, but I'm still not sure exactly what the core issue is. Is that right?

Ifat:
That's right. The errors about the database connection were super helpful, of course, but like most errors that come from the code, it may not tell the whole story or even worse, it doesn't really tell you always how to fix it.

Corey:
Sure.

Ifat:
So, from there, the task was really to try to identify what was resulting in the error in terms of … Well, we were hitting our database max obviously, but there are a few different components at play. There's the database itself and the pool configurations I mentioned that you can tweak. There's also the architecture in which we were … First, the write test itself, which is going to iterate over 400 jobs. Is there something that can be manipulated there? Then, also sort of the middleman is the architecture for background processing. In this app, we use a tool called Sidekiq that helps us manage that. Essentially, you can schedule jobs to be through Sidekick, and it can also accept a configuration around how many threads it might spin up, a thread being essentially how many concurrent jobs can you run at the same time in the background.

Ifat:
So, we had to kind of tackle those three issues or those three factors, I should say. From there, it was honestly like a lot of trial and error, a lot of reading first about through some pretty DevOp, low level DevOps, configurations that maybe I hadn't had a ton of experience yet previously, so a lot of reading. There are a lot of helpful articles about there that help provide guidance on … You want to have this sort of magic ratio between your database pools and your background job concurrency limits. So, sort of tweaking those, kind of changing the number that I can give Sidekiq to say how many threads do you spin up versus the number I give Rails to tell the app how many database connections can we allow at the same time. There's a lot of changing that and unfortunately sort of deploying right away to production, right, because as many developers might experience, there's nothing like production data to test with, right?

Corey:
Absolutely. Yep. At this point, it's not what you would traditionally think of as a code issue, more of an infrastructure issue, and sort of tweaking the knobs on the concurrency settings to see what works under a production workload.

Ifat:
Yeah. It's more about the infrastructure and making sure that the different pieces fit together in the best way that they can.

Corey:
What did you ultimately arrive at?

Ifat:
Yeah. So, we ended up lowering the Sidekiq concurrency pretty low. It's set at two right now, which it could probably go higher, but we ended up making it pretty low. We actually bumped our database connections up to 10, which is twice as many as the default. The thing you have to remember about database connections is it's not just the number of database connections that one process like Sidekiq can use. It's also any other processes you may be running, so if we're running other jobs, and also the web process, right, when someone's actually visiting the app. One other thing we did just to help both those aspects manage the load is we added a simple sleep essentially as we're iterating over 400 jobs, putting a bit of a timer or a pause before it tries to tackle the next schedule to put together. That also helps give Sidekiq a little bit more time to finish its work and sensibly close those connections when they need to be closed.

Corey:
So, it sounds like you ended up throttling the background jobs in a sense because Sidekiq, there's not really a limit to how many jobs it can run at once in theory with enough instances, but your database, there is a limit there to how much load it can handle from those jobs, so it sounds like you had to find the right balance between running the jobs quickly, but not too quickly.

Ifat:
That's spot on. Nailed it.

Corey:
You figured out pretty fast what that magic number is, where you're in a spot where this is not happening anymore.

Ifat:
I would like to believe that it was quickly … In all honesty, it was over a bit of time but mostly because in order to prove out that the configuration changes were even working or doing what I expected, because this was on production, instead of trying to assimilate or run these tasks and then delete data that might be moot, I would push up the configuration changes, let that task run over the next night, and then I prepared a script for myself that I could take a look at the database, look at the records, and see if I was getting the numbers I expected. So, it's sort of trial and error over quite some time in which as I tweak each configuration, I could see hopefully that at least the number of schedules being completed was increasing until eventually, I got to the number of schedules created equal the number of schedules I expected to be created.

Corey:
What did you take away from this? What are the primary lessons that you would take to the next app that you work on?

Ifat:
A couple themes came out. One, from this specific experience, this idea of understanding how your different systems or components interact with each other. So, we mentioned I've got my background processor, I've got my database connection. There's also Redis, which is sort of queuing up those background processes. There's all sorts of underlying pieces that can work well together, but understanding when you pull this lever, how does it effect the other lever, and what kind of options are you left with until you find that sort of just right combination.

Ifat:
The other theme just more in terms of debugging and tackling a problem was, is this approach of … kind of attacking like a scientist, right, kind of coming up with a theory. Okay, here's what I think the issue is, a theory and then a hypothesis, right, and then identifying those variables that may be contributing to your hypothesis, knowing how to not let those variables conflate, right, so maybe tackling one variable at a time, things like that, and having a means by which you can prove out or audit your experiment against your hypothesis. I thought that was really helpful and what ended up leading me to be able to be confident that I, at the end, solved the issue.

Corey:
So, less grasping at straws and more of a methodical structured approach.

Ifat:
Exactly. It's easy sometimes to get a good idea or a hunch about what it is or what you may need to do to fix it, but I have found in my experience, sometimes, when I reach for that quick gut reaction, I may only be putting a Band-Aid on the problem rather than unlocking the core issue.

Corey:
Thank you so much for sharing this experience and being on Code[ish].

Ifat:
Yeah, absolutely.

Corey:
Now, another story of dealing with scale. Here's Christopher Ostrowski, Chief Technology Officer at Dutchie. Chris, thanks for being on Code[ish].

Christopher:
Yeah. Thanks so much for having me.

Corey:
To set the scene, will you tell us a little bit about Dutchie?

Christopher:
Absolutely. Yeah. So, Dutchie is an e-commerce platform for the cannabis industry. Essentially, we enable local cannabis retailers to sell their products online. So, we basically power thousands of dispensaries' online ordering menus all across North America.

Corey:
Wow. When did you start and how has your growth been?

Christopher:
So, Dutchie really got started towards the end of 2017. I've been with the company since May of 2018. I was the third employee. The growth has been pretty much astronomical. I don't think that's an overstatement. When I joined in May of 2018, we were doing about 40 orders a day, and since then, we've seen you really crazy growth to where we're now the industry leader, and we're doing tens of thousands of orders every day and just growing at an astronomical rate.

Corey:
Lots of volume there, which brings us to the reason why we're here. What happened on Christmas Eve 2019?

Christopher:
Yeah. I guess setting the stage, right? So, it's Christmas Eve, and I'm at a family Christmas party. It's getting later in the day, and as you can imagine, Dutchie was not working that day, and we didn't have the people that would normally be around, respond to incidents. So, I was pinged on my phone through one of our monitoring alerts that our database usage was starting to go up. At this point, it wasn't too unnerving. We had had some database issues that sometimes manifest as high load but really didn't have much of an end user impact. So, I opened a ticket with our database provider and essentially went back to the party because all of our other stats, as far as latency and request rate, all those were looking pretty normal. A little bit of a blip but also given that it was Christmas Eve, there was kind of a thought, at least in my mind, of traffic should be low anyways, so if our databases is a little bit sluggish, that should be tolerable for right now, and we'll look into it when we're back on the keyboard.

Christopher:
Kind of went back to the Christmas party. Of course, kept my phone near me. About a half hour later, our API monitors started to alert. So, at this point, we knew something serious was happening. So, at that point, it was run, grab the laptop, start calling some of the other engineers on the team, and start getting to the bottom of what's going on.

Corey:
What was the API alert that made you sort of raise the alarm on this a little bit more?

Christopher:
Yeah. So, our latency alerts were very quickly got pretty crazy. We have warning criteria and an alert criteria. We kind of shot right past the warning right into the alert for our core API latency. The p95 latency stat was well above what we always aim for. So, that was the one that definitely kind of woke us up and said, "Okay, there's something serious going on here."

Christopher:
For the most part, synthetic monitors were still working. We could load the Dutchie platform. The menus would load, and they're actually pretty snappy, so it was very odd, right? It wasn't a total outage yet. Nothing seemed to be too crazy at that moment in time. So, from there, we started to dig through our logging data, and we then switched over to looking at the database monitoring, which had gone even further beyond where it originally was in terms of utilization and CPU. One thing that was really, really odd was the queues were stacking up. So, what that meant is we had just a huge amount of queries that were being queued up and the database server couldn't get to in time. Again, something that was really bizarre, and we didn't fully understand why that was happening.

Christopher:
So, at this point, we're probably about a half hour into our, what we call our critsit situation. We have a number of engineers online. We're all on the group Zoom. We're trying to figure out what exactly what's going on. At this point, we noticed that our order rate was going down. So, one of our North Star metrics is average orders per minute. That had started to essentially collapse. We noticed that new orders were few and far between. Again, the timing is of course odd, right, because it's Christmas Eve, and it was later in the day at least on the East Coast, and so most dispensaries had been shut down at this point. So, it was a little bit tough for us to know, is this really a technical problem, or is it just the real usage that we have?

Christopher:
So, we got through that, and we started to get more and more team members involved. Eventually, we were able to hone in on basically one API endpoint, which was just the create order endpoint. Our monitoring probably wasn't as good as it should have been at this point, but we were able to eventually just realize that this one path, the latency was through the roof on it. Once we knew that the create order endpoint was kind of going haywire, we definitely just turned all of our attention to that.

Christopher:
So, the next step was understanding kind of line by line exactly what could be causing an issue. This is where, at least in our mind looking back, kind of the fun problem here. As we talked about earlier, Dutchie had kind of been on a crazy trajectory, and towards the end of Q3, we really realized that right at the end of the year, we were going to be on track for our one millionth order. Huge milestone for us, and we even set up some cool visualizations so that we could see the order count ticking up over time.

Corey:
That's fun.

Christopher:
Yeah. Yeah. It was really cool. It was a really important milestone for us, really being a proud moment for the company. That's where we kind of were, right? So, we were right at that one million mark. We weren't quite there just yet, but we were getting close. So, right about this time in our critsit situation, we had some other team members outside of engineering trying to place some different orders. Basically, the user feedback was it was a timeout after 30 seconds, so they would just get an error, but we eventually realized that some orders were actually able to go through and were actually able to go through almost instantly, so that was really interesting to us.

Christopher:
We eventually figured out that the crux of the issue was that we have a vanity ID, right? So, when you pull up an order in the Dutchie system, we don't want to show a long UUID, right, to our end users. That's obviously ugly and also really inefficient for communicating what order you want to be looking at. Way, way back in the day when Dutchie was first designed, a decision was made to make that a six-digit number. That's randomly generated, so it's just numeric, so just six digits. Where we started to realize the problem was, was for certain orders, like we can place a test order, and when you place a test order, we actually append a dash T to the end of that vanity ID. Those orders were going through just fine, and this is where we finally realized what our problem was. What we eventually found out was that one million orders is exactly the number of combinations you get when you have a six-digit number.

Corey:
Oh.

Christopher:
Yeah. So, 10 raised to the sixth, that's a million. We were pretty much right there. So, what our code was essentially doing was before it inserted the order into the database, it would run a routine that would generate a random number and then query the database to see if that already existed. Because we had exhausted the total space of IDs, that query was essentially never finishing because it would just have to keep trying and trying and trying. So, we essentially created a denial-of-service attack on our own database. We immediately realized the problem, kind of shook our heads at the simplicity of it, changed the length variable from six to nine, giving us a billion orders, which that would be a fantastic problem to have-

Corey:
One day.

Christopher:
… maybe in a couple of years. Yeah. So, now, we had a billion unique combinations, deployed our API code update, and instantly, utilization went way back down, and orders began flowing once again.

Corey:
One million was the problem and also a good thing and a bad thing at the same time. So, I have a couple questions. Your monitoring at first didn't make it extremely clear what was happening. You knew something was wrong from the database load. You knew something was up with the API endpoint for creating orders. Is there anything that could have made it obvious faster where the issue was?

Christopher:
Yeah, definitely. So, when we retroed this issue, and of course, looking back and saying, "Well, how do we prevent this, and what could have prevented it?" monitoring really was the number one thing that came out of it. It's a linear problem, right, where over time, you're going to run out of unique combinations, and that database query is going to just take longer and longer and longer. So, it's a slow burning problem that obviously crescendoed very quickly for us, but we realized that we really didn't have the granularity and monitoring to see that, "Oh, hey, this one individual endpoint has been creeping up in latency over time."

Christopher:
Had we had that in place and also a lot of monitoring tools that allow you to kind of designate critical endpoints, this would have been pretty obvious to us. We probably would've seen for upwards of a month before the latency just continuing to kind of tick up linearly. We definitely would've kind of thought to ourselves, "Hmm, that's pretty odd. I wonder why that latency's going up." Could have been probably avoided had we had better monitoring there.

Corey:
That's interesting, so as you've reached closer to one million, the orders got slower and slower slightly over time, and you could have spotted that, but just looking at your metrics, it wouldn't necessarily stick out.

Christopher:
Yeah, exactly, because in the grand scheme of your API, this create order endpoint is actually not hit very often in comparison with all the other APIs that you have, right, for fetching menu information, for loading the dispensaries' information. Those are all endpoints that are going to be hit potentially tens of times per session where the create order endpoint is going to be hit maybe once, right? So, if you're averaging across your entire API, your latency is going to be pretty good. One endpoint that is, in order of magnitude, less utilized is really not going to have that much of an impact on your overall latency numbers.

Corey:
Sure.

Christopher:
So, the lesson there is you really need to make sure that you identify these critical API endpoints even if they're not that hot in terms of load or utilization. So, that's definitely an immediate change that we made to add proper call-outs to say for our business needs, these are the top 10 endpoints that just always have to be working, right, and even setting essentially internal SLAs on the latency for those endpoints is really helpful there too.

Corey:
What tools or resources have you found most helpful for implementing that improved monitoring?

Christopher:
Yeah. So, one tool we really like is Datadog. Previously, we weren't really utilizing a Datadog or New Relic, and because Datadog was so plug-and-play for us, it was really easy for us to just drop that in into our whole environment throughout monitoring, of course, application level concerns but also jumping down to monitoring CPU usage, disk space, disk utilization. All of those, Datadog's, I think, really good at kind of surfacing and giving you the whole picture, which can be tough in today's cloud world especially as systems become more distributed.

Corey:
So, if you were to say sort one takeaway for our listeners from this experience, what would it be?

Christopher:
I would definitely say that you can't instrument too much. You can definitely have things like alert fatigue. You can have things like … Too much log data, it can be tough to see the forest through the trees, but as far as instrumenting things, it's very tough to end up in a state where you're overdoing it. It's also, these days, very easy to do, right? There's a million and one tools out there to ensure your proper monitoring. Of course, monitoring is the start, but then making sure that operationally and organizationally, things are set up right. We realized that we really hadn't put in the time to set up our critsit procedures. Who leads the team through problems? Who's the team lead? Who's the person communicating to support? We didn't have these roles defined. That's also a really worthwhile task and exercise to go through with your teams. Everybody's going to have downtime. You try to minimize it. You do the best you can, but as we all know, it's unavoidable, so proper planning and preparation for when you do have downtime is always going to yield the best results for your customers.

Corey:
That's great advice, and Chris, thank you so much for sharing this experience and your takeaways from it.

Christopher:
Absolutely. Yeah, I really appreciate it and glad I could contribute.

Corey:
These stories show the flip side of success. More scale can bring unexpected challenges. When an incident arises under heavy traffic, it can be very stressful. These stories reminded me that we can rely on monitoring tools and a methodical troubleshooting approach to investigate calmly. This is the value of storytelling. When others share their experiences, we take away lessons for our own projects. Thanks to Ifat and Chris for sharing their stories, and thank you for listening.

About Code[ish]

A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.

Subscribe to Code[ish]

Hosted By:

Corey Martin

Customer Solutions Architect, Heroku

with Guests:

Ifat Ribon

Senior Developer, LaunchPad Lab

Chris Ostrowski

CTO, Dutchie
@TheKidCoder