Looking for more podcasts? Tune in to the Salesforce Developer podcast to hear short and insightful stories for developers, from developers.
115. Demystifying the User Experience with Performance Monitoring
Hosted by Greg Nokes, with guest Innocent Bindura.
How do you know an application is performing well beyond the absence of crash reports? Innocent Bindura, a senior developer at Raygun, shares the company's tools and utilities, discusses the importance of monitoring P99 latency, and talks about why developers should use data to drive decision-making.
In this episode of Codeish, Greg Nokes, distinguished technical architect with Salesforce Heroku, talks with Innocent Bindura, a senior developer at Raygun about performance monitoring.
Raygun provides tools and utilities for developers to improve software quality through crash reporting and browser and application performance monitoring.
According to Innocent, the absence of crash reports does not mean that software is performing well. Software can work - but not be optimal. Thus, Innocent takes a holistic view:
“I look at the size of my audience, and if it's something sizable, that gets a lot of traffic, for example, a shopping cart that gets a lot of traffic on a Black Friday. I would want to be in a comfort zone when I know that during the peak periods my application is still performing, so I tend to look at the end-user, how their experience looks like during very high peak periods. And from there I start working my way back to the technology that is supporting that application.”
Raygun really shines in monitoring the time spent in different functions and helping to improve the performance of highly hit endpoints. This includes performance telemetry of browser pages, the current application running, and server-side performance application monitoring. Raygun has lightweight SDKs or lightweight providers that can be injected into code. These provide a catch-all to deal with unhandled exceptions. They also encourage best practices for developers.
Over time, Raygun can provide a complete picture of how the user session performed “from the point they visited your page, logged in, visited a couple of pages, and then left your application. The crash reports and the traces relating to that particular user are also tied up with that session on the Raygun side.” Innocent highlights a sampling strategy that reduces the noise of APM data.
Raygun also provides a birds-eye application view that provides aggregated stats on application performance: “For the run product, you will have each page aggregated over time, regardless of how many users you've had in a period of time. You want to look at the individual sessions. That information is aggregated and you're able to see, for example, your median, your P90, and P99.”
Innocent focuses on the P99 figure because “whoever is in there has had a terrible time, and that forms the basis of my investigations. I want to know why there are so many sessions in that P99, and that P99 is probably a six or seven-second load time. I want to move that to a sub-three-second.” Innocent provides a definition of P99 for new customers undergoing the journey of performance optimization.
Next, Innocent asserts that decisions should be based on numbers and empirical evidence. He has found that the use of actionable data has enabled him to redesign applications and focus on the mission-critical command needed in real time.
Innocent concludes: “I think the life of a developer is an interesting one. We fit in everywhere situations permit, and we definitely take different routes to develop our careers. But ultimately what we should all be concerned about is the quality of the products that I produce. This definitely reflects on my capability as a software developer. What sets me apart from the next developer is not the number of cool techniques I can do with code, it's delivering a product that actually works and what better way of knowing what works when you actually measure things. Everybody should live by the philosophy of assuming nothing, measure everything. Everything and everything should be measured.”
Links from this episode
Speaker 1: Hello, and welcome to Code[ish], an exploration of the lives of modern developers. Join us as we dive into topics like languages or frameworks, data and event driven architectures, and individual and team productivity, all tailored to developers and engineering leaders. This episode is part of our tools and tips series.
Greg Nokes: Welcome to Code[ish], this is Greg Nokes, distinguished technical architect with Salesforce Heroku. Today I'm talking with Innocent, a senior developer at Raygun. Innocent, could you tell me a little bit about what Raygun is, and what you do at Raygun?
Innocent: So Raygun is in the performance monitoring space. We provide tools and utilities that developers use to improve their software quality through crash reporting and browser performance, and application performance monitoring as well.
Greg Nokes: So how do you approach the performance monitoring and the application introspection?
Innocent: So when I look at an application that is performing sub optimal, firstly my philosophy is the absence of crash reports does not mean that software is performing really well. There are hidden things in software, there are tools that we use when we develop that don't work 100% very well as we would want them to. They do get the job done, but there are areas in there that can be improved. So I tend to take a holistic picture, I look at the size of my audience, and if it's something sizeable, that gets a lot of traffic, for example, a shopping cart that gets a lot of traffic on a Black Friday. I would want to be in a comfort zone when I know that during the peak periods my application is still performing, so I tend to look at the end user, how their experience looks like during very high peak periods. And from there I start working my way back to the technology that is supporting that application.
Innocent: So the first point of call is obviously, a lot of times for a user, the response codes that they are experiencing on their end. And from there, I try and determine whether it's an issue with their browser, or an issue with the software on the server side. Given that it's an issue on the software with the server side, I would then start looking into all the applications surrounding that application that supports the final product that the customer is seeing, and pick that apart, find what exceptions I'm experiencing. If there are no exceptions, then look into the queries that are running behind the scenes. If I'm using an object relational mapper, that's a favorite one for me to go to, I'll then look to see how often certain code parts are being executed, if we're experiencing any form of N+1 queries, and look into ways of restructuring the code and remodeling the data that we are presenting to the user so that it comes out of the database in a more consistent and timely fashion.
Greg Nokes: And I assume that at Raygun you provide tools so you can do that introspection, so you can examine your code in a running environment, and then get good feedback on what it's actually doing?
Innocent: Yes, definitely we do. It's a well rounded approach, but one size doesn't fit all. From time to time I find myself having to put some arbitrary measurements inside my application that gets reported in a separate dashboard that Raygun doesn't provide. But overall Raygun does provide a one stop shop for most of the things that I would want to cite in my application performance.
Greg Nokes: Right. And one of the tactics I like to use is taking apart an individual requests, like you said, looking at the time I'm spending in the different functions, the time I'm spending talking to my backing services, whether it be databases or APIs, or whatever, and see where I can tune a little bit more performance out of highly hit end points, or end points that are accessed a lot in my code. Does Raygun help me do that as well?
Innocent: I think this is the area where Raygun really shines, because it paints that picture straight from the user experience right down to the synthesized performance. So when you load a page in your browser and the telemetry of the performance is sent through to Raygun, as well as your application running, and a performance application monitoring on the server side, and crash reports for it coming through, we can then associate all those telemetries together and show you that this user experience is associated with these crash reports, and it is associated with these stuck traces on the server side, so you've got that holistic picture of customer experience on the browser relating to the server performance on the backend.
Greg Nokes: So I assume it's like a lightweight gem, or something. Being that I'm a Ruby person, something that I would just put into my bundle or file, and then go ahead and pass off like an APIC token to access your guys' service so I can start feeding that information directly in, and then start generating those reports?
Innocent: Yes, that is 100% correct. And seeing that you are a Ruby person, I have just been working in the APM team this past few weeks, and we did launch APM for Ruby.
Greg Nokes: Oh, cool.
Innocent: So the way it works is you have got one or two gems that you will reference, depending whether you're using rails or sidekick, that gem you then provide environment variable that holds your API key for the first run. Once we see that API key for the first run, it then gets persisted in adjacent file that we will read over and over and over again for as many times as you restart your application.
Greg Nokes: Now I see You also have some crash reporting. Are you using those same tools for crash reporting, are you introspecting the logs, or what are you using for that?
Innocent: For crash reporting we've got lightweight SDKs, or lightweight providers that you inject in your code. Generally the way it works, there's two approaches. We provide a catch all, so all your unhandled exceptions that don't gracefully terminate a request, we can tap into those exceptions and report on them. But there's a better way because we try and encourage best practices for developers when they're working with software, and one of the best practices is you anticipate and handle all the exceptions in your application so that the user experience is not clunky, but you gracefully handle and try and recover weight as possible. But an exception is an exception, you do need to know about it when it happens. So we do offer a way of manually sending those exceptions as they occur and you catch them in your try catch blog, not sure what the equivalent of Ruby here is, .net person, so I'll give the examples from a dynetics perspective.
Innocent: So when you catch your exceptions, effectively that exception has been handled and it might not bubble up all the way into the hook for the catchall. So there you will have to implement some manual log in, the same way we would log those exceptions to a text file, and then have a look at it, or maybe log to a database, that's the same way you would just look those exceptions to Raygun. And doing so also comes with and added edit advantage of you can add tags and extra information maybe relating to the user who experienced that exception, and it offers you better troubleshooting options when you know who has been affected, when it happened, where it happened.
Greg Nokes: So being able to tie together the browser experience and the code introspection on the server, that seems pretty powerful. Do you have a method of tracking a user's journey through the application so I could, maybe with anonymized data, look at one user as they transit through the application and see all the end points that they hit, and see what their individual experience is like.
Greg Nokes: That's really cool.
Innocent: Yes, and over time we're also able to give you a complete picture of how the user session performed from the point they visited your page, logged in, visited a couple of pages, and then left your application, and the crash reports and the traces relating to that particular user are also tied up with that session on the Raygun side. An important point, though, is with APM. We don't track all the traces for every user depending on the sampling ratios that you choose, because APM tends to send a lot of data, a lot of it which might be just fluff you might not be interested in, so we've got a sampling strategy that reduces a lot of that noise and give you some interesting information when the interesting information is available. So not all users might have traces, but if you set up your tracing for one for one, we will have all that information.
Greg Nokes: So how do you combine, I understand having that user token that you can follow them around with, how do you combine multiple users experience and multiple application functions experience into a holistic view of the overall application performance, almost like a generic score, or something like that. Do you have a concept like that?
Innocent: Yes indeed, we do. So we have been speaking of the user specifics, which is a result of a drill down. When we actually go a number of levels higher and get a birds eye view on the application, you will get your aggregated stats on the application performance. Say for the run product, you will have each page aggregated over time, regardless of how many users you've had in a period of time, you won't to look at the individual sessions. That information is aggregated and you're able to see, for example, your median, your P90 and P99, which is what interests me about the run product because I tend to focus on the P99 figure, whoever is in there has had a terrible time, and that forms the basis of my investigations. I want to know why there are so many sessions in that P99, and that P99 is probably a six or seven second load time. I want to move that to a sub three-second.
Innocent: So whoever is sitting in that P99 buckets is of interest to me, and I'm able to drill down further into their specific sessions to find out what was going on. More often than not you'll find probably we have a data center in the United States and this customer is sitting somewhere in South Africa, for example, and there are load time is affected by latency. There's really nothing we can do for that kind of user. Oh, perhaps there is. AWS now has data centers in South Africa as well, so it might mean that we need to route the traffic to a data center that's closer to them to get rid of that latency, or our assets are loading a lot slower, we might have a case site closer to them. So we do have that ability of taking a birds eye view on everything and decide on the specific areas that we really want to drill into.
Greg Nokes: Yeah, I wish we had a way of breaking the speed of light, I always joke around with folks when we're talking about latency between data centers, and stuff, that we haven't been able to figure out how to break the speed of light yet, but we are working on it, so keep tuned, maybe we will one of these days. And I totally agree with you about the P99. Over the last 12 years that has really become that holy grail to me is the more I can push that P99 number down and get it as low as possible, the better experience for all of my customers on any website that I'm working for.
Greg Nokes: So I think that not a lot of people think about the P99, because when I talked to new customers who are just undertaking this journey of performance optimization, a lot of times I have to educate them on exactly what a P99 is. So maybe we could take a few seconds and you could tell me your understanding of what a P99 is so any folks that are listening who aren't aware of it will come away with a knowledge of what that is and why it's so important.
Innocent: All right. So my understanding of the P99 number, of how we use it and how we display it to our customers, is an aggregate value that a subset of your customers fall into. If you were to draw a normal distribution curve, it would appear like a bell ship, and there is a long tail that will stretch towards the right, almost like in a [inaudible] boundary. And right at the tip of that long tail you've got a bucket with an arbitrary number. If you've got millions and millions of customers on your website, or an application, that number might be sitting in the hundreds of thousands. I always tie this with behaviors. I'm a millennial myself, and the younger generations after millennials have got far less patience than I do, and I've got far less patience than the generation before me, the baby boomers.
Innocent: So I want to maximize profits, and I know I'm dealing with people who don't have patience for slow loading sites. I'll give you an example, if I'm shopping online and the application I'm shopping with is not performing, I'll simply shut it down and move on to the next, and if that's not also performing, I'll shut it down and move on to the next. So I'm interested in keeping these people that fall into that bucket of slow loading times. So for me, the P99 represents the number of people having the absolute worst experience with an application. That is why I take particular interest in it, and find out those reasons that are affecting that small number of people. If I can solve their problem, I have probably improved the life of 98% before them.
Greg Nokes: Exactly, that's exactly how I think about it, that's a fantastic explanation. So we've got all of this data coming in from the browser, from the application, we've got it bubbled up into dashboards and overarching metrics. What do you suggest people use all of this, besides pushing down that P99 number, what other uses for all of this information that you're gathering about an application's performance, what do you think folks should be using that for?
Innocent: Right, so from experience I have learnt that decisions should be made based on numbers. I think the better part of my career, I have thumbed sucked a lot of numbers without real empirical evidence of why we are deciding on a certain thing. And if I can give you an instance where this numbers actually did help me in making up a lot of decisions. One of my previous roles, I was a team lead, and we had an application that had been problematic in that company for over two years, and the problems were, we were trying to do everything in one go. And then hooking up some monitoring tools and augmenting that user experience, we then realized that we were doing too much in one call, and there were certain transactions that could obviously be deferred and processed on the side without the user waiting for the feedback because most of the time they wouldn't be interested in that feedback anyways almost immediately, it will be something that they would need to look at maybe a month later as an aggregated report, or something.
Innocent: Where I'm going with this is looking at that data enabled me to redesign the application, and I took the command query responsibility, segregation part in, where stuff that is not mission critical is deferred for processing by a number of workers in the background, and the stuff that is mission critical is executed in a transactional manner in real time. So when you actually look at the performance of your application, you've got to determine what the happy path is, what the mission critical path is, and decide on a differing processing for later in the background. Not everything needs to be transactional and not everything needs to be real time.
Innocent: So we are collecting all these metrics, reporting on them, and giving you data that you would need to look at to make an informed decision. If you were to ask me why the application was originally designed in that manner of everything is transactional, that is because somebody thumb sucked what the best practices were, and thumb sucked that the loads would be solely to end, things would be performant. But over time that didn't prove to hold up, and it called for a complete redesign. So you might be a company out there experiencing one of these constant problems with your applications where you can't decide what to keep and what to hold off. Having this kind of data enables you to see what that mission critical path is, what that happy path should be looking like for your customers, and make that informed decision based on actual actionable data.
Greg Nokes: So do you have any closing comments?
Innocent: I think the life of a developer is an interesting one. We fit in everywhere where the situations permit, and we definitely take different routes to develop our careers. But ultimately what we should all be concerned about is the quality of the products that I produce definitely reflect on my capability as a software developer. What sets me apart from the next developer is not the amount of cool techniques I can do with code, it's delivering a product that actually works, and what better way of knowing what works when you actually measure things? Everybody should live by the philosophy of assume nothing, measure everything. Everything and everything should be measured.
Greg Nokes: That's fantastic advice. It's been phenomenal talking with you, it was educational for me, and I learned a lot about Raygun and what you guys do, and how you think about the broad space of performance introspection. It was just great to talk to you, Innocent. Looking forward to doing this again, and with that, thanks again for being on Code[ish].
Innocent: Yeah, thank you for having me. I love this kind of things where we talk about technology and best practices, and kudos to Heroku and Salesforce for having this sort of thing out there.
Greg Nokes: Absolutely. We're always happy to give back and to talk to folks that have been in the industry. I had an old boss who used to like to say that he didn't trust anybody in this industry that didn't have a few scars. So to be able to show off our scars and the stories behind them, and hopefully allow other people to get different scars than we've gotten.
Innocent: Yeah, that's definitely the way we should be doing things. My experience should not be the next person's experience.
Speaker 1: Thanks for joining us for this episode of the Code[ish] podcast. Code[ish] has produced by Heroku, the easiest way to deploy, manage, and scale your applications in the cloud. If you'd like to learn more about Code[ish], or any of Heroku's podcasts, please visit heroku.com/podcasts.
A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.
Master Technical Architect, Heroku
Greg is a lifelong technologist, learner and geek. He has worked at Heroku for over 8 years.
Senior Software Engineer, Raygun
Innocent has 10+ years in software development. He's often found in the trenches sluicing every new technology fad to find out what works best.
More episodes from Code[ish]
Lisa Marshall and Greg Nokes
This episode of Codeish includes Greg Nokes, distinguished technical architect with Salesforce Heroku, and Lisa Marshall, Senior Vice President of TMP Innovation & Learning at Salesforce. Lisa manages a team within technology and product... →
Robert Blumen and Marcus Blankenship
How can developers learn from catastrophic errors such as airline disasters? Learn how understanding the root causes of failure in complex systems can prevent their recurrence. →
Karan Gupta and Marcus Blankenship
How can applying the right technology choices at the right time impact your coding and business choices? Karan Gupta explains how practicing “pragmatic engineering” can have an oversized impact on business and business efficiency. →