95. Intelligence Through Logging

Tools and Tips
November 3rd, 2020
Episode 95
29:05

Also listen via

95. Intelligence Through Logging

Hosted by Corey Martin, Ariel Assaraf

Corey Martin, a customer solutions architect at Heroku, interviews Ariel Assaraf, the CEO of Coralogix, a platform that helps companies get a grasp on their log data. All too often, logs are considered as only a useful debugging tool. After receiving an alert around high resource usage or an elevated error rate, a developer might check their logs to see what caused the issue. But Ariel argues that this is too late to investigate a problem; by visualizing and alerting log data, you can figure out production problems before users encounter them.

Metrics, in other words, are a lagging indicator, while logs are a real-time representation of how your code is really performing. One way to reconcile these two is to aggregate log data and funnel it into other long-term metric storage. This would allow you to see longer term trends. Ariel provides a scenario where log records appear in groups, such as a user purchasing a product, followed by an API call to Stripe, and concluding with an email notifying the user. A platform like Coralogix can automatically identify that the three logs arrive together within a certain time frame. If, for any reason, one of these steps fails to log, then a notification can be set up to notify the team to proactively investigate, rather than a customer writing in to report an error.

For an organization to beginning using logs as time-series data, Ariel recommends three things. First, a unified log format, which could be something structured like JSON. These can be generated by a middleware service. Next, a shared understanding across teams on the severity with which to log a message. The final step is to set up an alerting policy; not only which types of alerts to create, but also where they go, such as Slack, email, or text message. After that, you can begin to incorporate your logs into your monitoring processes.

Links from this episode

Coralogix is an observability platform for logs, metrics, and security

Show Notes

Corey:
Hi. I'm Corey Martin, a customer solutions architect at Heroku. Today, we're talking about using logs in new ways. I always thought of logs primarily as a place to look when things were going wrong in my apps, but our guest today explains that logs can be the best source for live business intelligence. Ariel Assaraf is the CEO of Coralogix, a logging platform that helps over a thousand companies get a grasp on their log data and proactively address their technical and business problems. Ariel, welcome to Code[ish].

Ariel:
Thank you, Corey.

Corey:
First of all, tell us a little bit about Coralogix and why you started it.

Ariel:
Yeah, so it's an interesting story actually. Coralogix started about five years ago. We're a log analytics company. Basically, machine-learning-powered log analytics that is now evolving into more areas, like infrastructure monitoring and security, and soon, even tracing. When we started the company, it came from a pain that we had. So I worked for a multinational corporate with a lot of big enterprise clients, hundreds of them, and we just had to dig into logs so many times to figure out production issues.

Ariel:
Having big enterprise on-prem customers means that you can't just log in and have a look at the data. You need to actually go all the way to the customer, export the data. Sometimes you need to mask a lot of the data because it contains information that you shouldn't see, and then manually browsing through gigabytes and terabytes of logs to figure out what went wrong in production. It was so painful, and even the tools that we used to monitor and analyze were not sufficient. So me and a couple of friends just decided to stop everything, leave and start our own company that will solve the problem of, let's call it, lack of productivity in log management, log analytics.
<!– more –>
Corey:
So you mentioned the case where a customer says something is wrong, and then you look at the logs to see what that might be. I imagine that can be a stressful situation. You're reacting. You want to figure out as quickly as possible what's happening. Based on your experience in logging, what can a team do to make it a little bit less scary when they go in and look at their logs in a really pressured situation? How can they make it easier for themselves to figure out what's happening?

Ariel:
One of the biggest challenges of a company in our space is to keep our customers happy because when they arrive to the system, many times, they're already under so much stress and so frustrated that any tiny delay, any tiny feature that is not 100% ironed is a source of a lot of frustration and anger. So you need to really, really be great in order for your customers to like your product and find it delightful. But to your question, I think that it's a lot about the state of mind that people have. So a lot of times, people would look at log data as a debugging tool or it's called a post-mortem tool. So you get an alert. It's got an alert from your customer, or you get an alert from one of your other system, metric system most of the time, infrastructure monitoring, and then you see, "Okay. I have the latency, high CPU, high memory, customers complaining. Whatever. Now, let's dig into logs and see exactly what happened there."

Ariel:
This is where the problem actually starts. When you want to figure out production problems before your customers do, or before performance… there were days, or before you have latency, your logs are actually the best source of data in order to do so. So if, as a customer, you want your logging experience to be less stressful, the best thing for you to do is actually use your logs before there's a crisis. Visualize your logs. Alert on your logs. Learn the data or have that data fed into a channel where you spend your time. That would be like Slack, or Microsoft Teams, or whatever, and then be ahead of the curve. When you enter your log data, knowing that, "I need to solve those or else I might have a problem in a few hours," you're much less stressed.

Corey:
What are some examples you've seen of a customer who has a great logging set up, and they're likely to see an issue before a customer sees it?

Ariel:
So we have a lot of customers that we look at as good examples of proper logging, proper monitoring to mention a few, and then we can drill down into a couple, I think, Masterclass who are doing a great job in their logging and monitoring. Capgemini are great. Monday.com are great. Postman really inspired us. We can dig into the Capgemini story, for instance. What they've done with logs is actually incredible. So they take in logs from multiple systems as they are rolling a very big change for one of their biggest clients, which is a large car manufacturer. They basically visualized their logs both as textual visualizations to understand exactly what type of fears they have, what type of broken flows they have using our flow anomaly, creating a lot of alerts to understand when something goes wrong and be able to tackle that problem before it becomes a crisis and also translating a lot of their logs into metric data.

Ariel:
So unlike metrics, which are our lagging indicator, logs are the first encounter of your code with reality. Whenever you run your application, whenever someone uses your application, as your code runs, it prints your logs. Metric data infrastructure monitoring is basically a result of that code running that is using resources on your infrastructure that are reflected in your metrics. But if you are connected to your log data and you transform logs to metrics, which is something that you can do with our tool, for instance, basically, define aggregations or other metric measurements that you want to extract from your logs and keep them longterm, you're able to see trends that sometimes evolve over months of, let's say, a specific area that affects your performance.

Ariel:
Then, whenever there's a spike of usage, that error happens multiple times, greatly affects your performance, and then that is reflected with your customer's experience. That is something that you'd see with metrics or tracing only after the latency had happened. But if you're tracking your logs properly, you'd be able to see those small, tiny spikes that won't affect anyone, but you see them, and you know exactly which area to solve, and then you can do that before anyone ever experiences problems in your application.

Ariel:
So Capgemini had done a great work both analyzing the anomalies that we generate. They're using Coralogix with Heroku. So they have the Heroku tags, the Heroku pipelines integrated. So they automatically benchmark any change that they do to production. They've connected Grafana to Coralogix to take the metric aggregations that we create, present, and alert on them. What they've done is they've exposed some of these dashboards to their own customers. So actually, their customers are able to track metrics performance and also business metrics from Coralogix's integration with Grafana, and their customers have full transparency on how the project is going without even having to request Capgemini to report them.

Ariel:
So that brings, first of all, better stability. Second of all, better tracking business KPIs, such as inserts to tables, new users, onboarding, purchases, and so on and so forth. But most importantly, that creates a relationship of trust between Capgemini and its very important customer that has full transparency over what's happening in his production, and he knows that his technical consultants at Capgemini are delivering exactly what he wanted at the performance that he wanted. So that's the use case that really inspired me because it goes way beyond log monitoring even. So we spoke about going from reactive to proactive, but this goes way further into business and into customer relationships and transparency.

Corey:
So much that I want to unpack there. First of all, the idea of logs as metrics. Traditionally, we think about logs as free flowing massive text. Maybe you run a search on it, but it's, at the end of the day, a bunch of log lines. You're talking about logs as something that can power more structured metrics whether they'd be graphs, or KPIs, or something like that. For someone who's never used logs in that structured of a way, could you explain how that works?

Ariel:
Basically, log data is time series data. It's generated from your application. It has a lot of texts. So if you think about that text as labels to that time series data, you can use logs in a much smarter way. Then, you ask yourself, "So how come not everyone are using logs as metrics with unlimited labels of data?" The reason is that logs are very expensive. They're very chatty. There's a lot of data that is present at any X record, and then it's a lot more efficient to just send the metrics directly. So what we've created for logs is… One of the features is called Logs2Metrics, where you basically define a query, an aggregation, and then you're able to group this by fields in your log to add labels, and then store that metric, let's say, every five minutes. So let's imagine something like whenever the log text contains user purchased and contains that the purchase size is worth over a thousand dollars, save me that metric and group this by the country, the city, and the customer age.

Corey:
Wow.

Ariel:
So now, what Coralogix is going to generate for me is the amount of big purchases that users have done on my website per country, city, and user age. So we've just taken a log record that was printed, like the boring, classic log record. "User purchased product. Price is a thousand dollars in New York. User age is 16." This is like a textual record that we see in any log, but once you parse it and turn it into a meaningful metric, suddenly, you have this powerful business information that is coming to your logs way before it goes to any BI system. But more than that, when you tie it up to operations that are also… operational database also present in a log, think about saying something that no business person can tell. "Hey, we have less purchases in the past hour." That graph basically converged with the graph of error trends.

Ariel:
So we see more errors, more latency, and together with that, we see a decline in large purchases. This is extremely powerful, and only log data contains all that information in a single place. So what Capgemini had done, they've extracted those unique, important business metrics together with the operational and performance data together with the raw logs, and gave that all as a dashboard to their customer so he can basically see how the improvements that Capgemini are performing for him are affecting his business KPIs.

Corey:
I've always thought of logs as the most live, recent source of data, but the data that I've been thinking about is errors or really technical stuff. Not business data, but it makes sense. The business data is there. It's a matter of using it. Were you thinking of that when you started Coralogix? Is that something that you've more recently realized the use of?

Ariel:
Yeah. It's pretty cool how most of the time your customers are smarter than you in figuring out the right use cases. So we just started seeing that about a couple of years ago with big clients of ours, like BioCatch, for instance, that are doing fraud detection for banks. Suddenly, they were using Coralogix mainly as a very powerful tool to create reports for their clients again saying, "Here are the people that visited your site from these geographies. Those threats, these were the actions that they've taken." All of that was just coming in the log data. They could have extracted in multiple ways, use ETL to enrich, parse it, pour it into a database, but it's already there in the log data. It's a better source of truth because it doesn't depend on how you've implemented your collection or BI. It's just generated with your code, and it's rich with information that combines that business data together with operation data and performance data.

Ariel:
So basically, we never thought of that. I would be happy to say that yeah, we thought of everything, but no. But once we've seen that, we started thinking of tools like the Logs2Metrics converter, like the fact that logs are great because they're so verbose, but logs are super expensive because they're verbose. So we started questioning our customers and asked, "Guys, you're using the anomaly detection features, and the alerting features, and the dashboards, and that's all clear to us. But now, you started using Coralogix for BI. What are the pains?" They said, "It's super rich in data, which is great, but it's unscalable." So we asked, "What it is within the log that you use, or what exactly do you need in that use case?"

Ariel:
In order to meet that, we re-engineered Elasticsearch, which is the index in our backend. We've rewritten big parts of its parsers to allow a lot of the actions that we are performing on the data to extract the metrics, to discover anomalies, to build aggregations, to stream live data into our clients, either CLI or web interface. Now, we're able to do them in real time without storing the data, and that means we're able to provide our customers everything that they need for their monitoring and BI use cases at about a third of the cost. So we're eliminating the things that prevented logs from getting their place in the past five, six years.

Corey:
You're only storing what the customer really needs at the end of the day, which are these metrics that they're actually using. Not every single log line in all of history?

Ariel:
Exactly. Only what's important. So we let them know about the statistics and the metrics, and we let them know if there's an anomaly, if there was a spike of errors, or if we discovered an alert that he defined. He can see the live data. He can obviously choose that certain application, a certain service he wants to store it on hot storage. That's okay. But whenever it doesn't meet that use case of like very frequent searches of the actual raw data, we only store what's important. So they can basically get the signal without paying for the noise.

Corey:
So we've talked about a few use cases here. The traditional going in and looking at your logs in the middle of a crisis, the proactive alerting based on metrics that you've established from your logs. You've mentioned anomalies a few times, and I want to dig in to that a bit more. What kind of anomaly detections does your platform do? What does it come with out of the box? What do you set up, and how do you see customers using anomaly detection?

Ariel:
So when we started the company, like I mentioned, the three main things that we want to answer was what, where, and when. So we started thinking about the biggest problems that people face today with logs. First of all, there's just too many of them. Second, it's very hard to understand correlation between events, and third, there are so many microservices and components. It's impossible to define thresholds for all of them. So we've created this way of gradually analyzing your data that out of the box, without you having to configure anything, gives you all the insights that you need.

Ariel:
So first of all, the first step is us clustering the data. So think about a log that says, "User Corey logged in from New York," and then the second log will be, "User Ariel logged in from San Francisco," and then the third log will be, "User John logged in from Austin, Texas." So these are basically three identical logs with different variables. Coralogix automatically understands that instead of showing you three text records, I can just say, "User logged in from a certain city and a certain state." Then, if you click the variables, you will see the distribution.

Ariel:
So if you think about it, if you do that to all the data in real time, instead of looking at an infinite amount of logs, an infinite amount of text, you only look at the unique records, or if you call it that way, you go back to the amount of logs that you have inserted in your code. So that's the first step. Now, we have templates of data and not just unstructured text. Now, we start learning the behavior of these templates and understand exactly when should that template arrive throughout the day, how should the parameters distribute, and how does a normal behavior for that look like.

Ariel:
The next step would be to correlate between the templates. So if every time I purchase a product on a website, the first log would be, "User purchased product." The second log would be, "Transferring credit information to Stripe." The third log would be, "Sending successful notification to the customer." Now, Coralogix will identify that these three logs usually arrive together at a certain timeframe, at a certain ratio, at a certain order, and notify me once it's breaking. So whenever someone didn't complete his transaction for any reason and it happened enough times within a certain period of time, we identify this is a broken flow and automatically notify our customer.

Corey:
Automatically? Sounds like a bit of AI in there. You're not always asking the developer to specify what an anomaly is. You're guessing based on the patterns that you see throughout the logs?

Ariel:
Yeah. It's something that we are very consistent with. We are not trying to educate the market. We understand, like this podcast, saying, "You need to be proactive. You need to do this. You need to do that." But people are very smart. We're not smarter than anyone else. They're just busy with other stuff. So we're saying you want people to be proactive with logs. Just do that for them. So we automatically learn the patterns of the templates. We automatically learn the behavior, automatically notify them new errors, or suspected errors, or top errors, and then automatically figure out the flows, automatically notify when a flow is breaking.

Ariel:
Then, the third step is that we learn all the trends for all the different applications and services. We understand the ratio of errors, the ratio of bad API responses, the ratio of exceptions per hour, per day, per application, per service, and then we notify whenever there's a spike. So if there's suddenly more errors than we expect for a certain service at 9:00 PM or on a Sunday morning, we'll notify automatically. We also send a daily report of the top errors and suspected errors. So we basically interact with you so you get insights by push. You don't need to come in and pull them from the system.

Ariel:
The last part is we integrate with the CI/CD pipeline. For instance, Heroku pipelines, or Jenkins, or CircleCI, and then we use all the knowledge that we have on the standard behavior of your system on the baseline to benchmark every change you do to production. So we compare the two versions, the current and the previous one, and tell you something like, "This new version created this new error, broke this flow, generated this alert," and so on and so forth. So you get a lot more confidence doing CI/CD. Again, that affects your actual business. That means you're faster to market. You fix things faster. You're a better company that is more engaged with creation and not maintenance. So this is basically what we're about in terms of anomalies and machine learning.

Corey:
So for clients who might be starting out, maybe they're moving from another logging tool. Maybe they didn't have a logging tool before. Maybe it's a brand new application. They're not even sure what their metrics are yet, and they want to start logging out right. What would you recommend that they do first to get a solid logging infrastructure in place from the start?

Ariel:
So I say three things that you need to set in order to be successful with logs. One is unified log format. The organization needs to write preferably JSON format. They need to understand the key types, and there needs to be some kind of middleware platform that standardizes the logs that all the different services sent and only then sends it to a service. That's one. Number two, standardized severity. Different people in the organization classify logs with different severities based on their own experience or their own ideas for how stressed of people they are. But if you standardize severity and anyone who looks at an error or a critical log understand exactly what it means to production. Number three is set up the alerting policy. What types of alerts do we create? Do we push an alert to Slack, or email, or phone, or not, and build the infrastructure to receive those alerts? So the problem with Slack channels, the proper convention for creating alerts and their descriptions, and so on and so forth.

Ariel:
Once you have that, you have the right format of data streaming into the system. You know the severity of the data, and then you can create precise alerts whenever a severity of data is exceeding what you want. Then, you send them properly to the right people in the right Slack channel, and they obviously have standardized data that they can investigate. This is your first step to identify things faster, and then solve them faster. Obviously, there's a lot more to do. But if you complete these three steps, you should be at a very good point to start this journey.

Ariel:
Also, what I say is start with something. Send the data. Our system automatically learns it, generates the insights, cluster the data, and then you can start improving from there. But any system you send the data to and starts giving you some level of observability will lead you to improve. So logs being such an important part of your infrastructure in monitoring, I guess any engineer now, just put logs in front of them and have good observability into them, will just constantly improve. Before you know it, you have a great tool that you can rely on.

Corey:
I imagine you have visions for the future as well that go beyond even what you've said here. Could you share a bit about where you see logging going from here?

Ariel:
Particularly for us as a company, we're going to expand. We already have a metric solution and security solution, soon going into tracing. So the idea for us would be to combine everything and allow you to understand the connection. We believe that logs will be the leading indicator, and then dragging other information from the security component from the metrics, from the tracing, and giving you a full report of an incident. So you can actually understand that there's something happening and solve it very fast.

Ariel:
Specifically, in the world of logs, we're continuing to press on price. We're continuing to improve the technology and change the padding of standard price per gig. What's happening today in the market is that any product tells you… They have a price per gigabyte that you ingest, and there's competition obviously. Some are cheaper. Some are less cheap. But the problem is that you're paying the same amount for a log, whether it's the most important log in your production or the least important log in your production.

Ariel:
So think about you walking in a supermarket. You fill up the bag, and as you step out, someone weighs the bag. He doesn't care what's inside and tells you, "Five pounds. Five pounds costs $6." We're going towards a new direction where you basically pay according to the importance or the value that a specific log provides to you, and the price per gig varies depending on how important the data is for you. We're not only talking about retention periods, or hot or cold storage. We're talking about data pipelines on demand, different pipelines that will transform analyzer data based on your needs, and your needs will determine the price per gig. That way, we're getting closer to the world of cloud of on-demand as you go basically pay for what you need and not commit to packages.

Corey:
Well, this has been really enlightening. I've learned that logs are really at the root of a lot of things. It's just really interesting, and I appreciate you sharing your insight about where logs have already gone, which is I think to a much more useful, interesting space and where you see them moving in the future. So Ariel, thank you so much for being on Code[ish].

Ariel:
Thank you very much, Corey.

About Code[ish]

A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.

Subscribe to Code[ish]

Hosted By:

Corey Martin

Customer Solutions Architect, Heroku

with Guest:

Ariel Assaraf

CEO and founder, Coralogix
@ArielAssaraf