Talking Traces and OpenTelemetry

Deeply Technical
October 22nd, 2025
17:42

Also listen via

Talking Traces and OpenTelemetry

Hosted by Jon Dodson, Alex Arnell

Jon Dodson has an 11-year Heroku veteran with him on the podcast this week, Principal Member of Technical Staff Alex Arnell. Together they talk through the native integration of OpenTelemetry in Heroku Fir, the benefits of traces over traditional logs, how they assist debugging, and what’s next for observability in modern development.

Show Notes

Narrator
Hello and welcome to Code[ish], an exploration of the lives of modern developers. Join us as we dive into topics like languages and frameworks, data and event-driven architectures, artificial intelligence, and individual and team productivity. Tailored to developers and engineering leaders, this episode is part of our Deeply Technical series.

Jon
Hello, my name is Jon Dodson and I’m an engineer working for Heroku. I’m a big Heroku super fan, and I’m excited to talk to you all about what awesome stuff is happening at Heroku. And today to talk about what’s awesome at Heroku, I’m joined by Heroku’s own Alex Arnell. Hello, Alex.

Alex
Hello, Jon.

Jon
All right Alex, let’s start things off. So, let’s start out by telling us a bit about yourself and what you do for Heroku.

Alex
Sure thing. So, I’ve been at Heroku for coming up 11 years, actually just crossed the 11-year mark.

Jon
Oh, congratulations.

Alex
Oh, thank you. Yeah, I’ve come full circle at my career at Heroku. So, I first joined Heroku to join the Heroku telemetry team. And that team existed for a few years and then fizzled out, and I bounced around a few other teams at Heroku, always kind of focusing on the backend side of things. And then, yeah, recently in the last year, I managed to get a telemetry team reformed, this time we’re focusing a lot on OpenTelemetry, of course.

Jon
Awesome. So, as you’ve said, you’ve been working at Heroku for about ten years now, 11 now. And I wonder what your most memorable time was delivering something here. Any exciting highlights from your time?

Alex
Oh yes. Obviously, I think for me, the most memorable thing that we’ve delivered is one that actually just came out a few months back, which is the Fir platform. I had a big hand in that effort. And it’s the first time we have a platform-native OpenTelemetry offering at Heroku.

Jon
Yeah.

Alex
So that was the culmination of a year and a half worth of work and many more years of pushing and prodding the right folks to get that project on the table and out the door. So definitely was exciting. And there’s a lot of buzz, I think, in the community around that release, so.

Jon
Yeah, I agree there really was and I think a lot of people are really excited internally to get that out and continue working on that. In your recent talk, Lessons Learned Adopting OpenTelemetry at Scale at a conference for the Cloud Native Computing Foundation, you said you think of scale not in terms of scaling up systems or in terms of users, but you look at it in terms of years of code and code repositories in Heroku. I wonder why looking at years of code is a helpful scale metric when thinking about Heroku projects?

Alex
Oh yes, there’s a lot of context, right? With… if you’re embarking on a project or a journey of adding OpenTelemetry or really even adopting anything new across an entire engineering org, with Heroku, it… it’s a large engineering team that’s been around for… since 2007, so a number of years with different repositories over that time. And that statement is really thinking about what is the effort that’s going to be involved in getting all of these teams on board with the idea? And kind of that’s where I was going with that in the talk, because it’s more than just, I’m just going to embed the SDK into my project and then that’s done. If you’re trying to get that across the board, then you got to get many teams involved and you got to think about the full scale of that.

Jon
Yep. That makes a lot of sense. So, there’s a lot happening in the tech industry for good or, you know, maybe sometimes not for good, according to some people. So, for you, what’s one thing you would have the industry focus on for the good of everybody?

Alex
Oh, that’s a good one. And I think you would probably agree with me like AI is currently the big thing, right?

Jon
It’s huge. So big.

Alex
Yeah, yeah. So big. And it’s moving really fast. You know, so much stuff that’s coming out. I’m obviously going to be a little bit geared toward the telemetry and the observability side of things, and I think that’s an area that we need to focus on. There are others in the industry that have similar thoughts, and they’re spending a lot of time thinking about that and discussing it. I believe, like a lot of the AI SIGs are probably some of the most popular and most attended these days.

Jon
Right, they are.

Alex
The OpenTelemetry SIGs. With MCP especially, and you have like agents calling out to other agents or agents that are generating self-generating code to call into other agents and doing crazy AI things, like being able to trace all that and know where things are falling apart and how long things are taking is going to be interesting.

Jon
Yeah, I agree with you. I think OpenTelemetry is a really good thing to focus on when there are so many moving components in the world of AI, where agents are talking to agents are talking to AIs are talking to agents and all that.

Alex
Yeah, there’s so many remote calls that are happening and…

Jon
Yeah, yeah.

Alex
…processes being spun up in the background. Yeah, there’s a ton of things. And I feel like connecting all those dots and getting that telemetry, especially when you think about calling out to third party things, because those third party things are obviously going to be run by third parties, so how do you… like can you connect all those dots together, or do you end up with a bit of a black box that just says it took 15 seconds to get a response back from that third party? Yeah, it’s an interesting one.

Jon
Well, if you want your platform to be performant, I think it’s really important to understand the time frame of all of that, so I think that’s a really good point. Speaking of OpenTelemetry, I know we started talking about it a little bit already, but some people listening to this episode might be hesitant to make the jump to OpenTelemetry. So, lots of companies get what they need through logging, logging dashboards and have a huge investment in that. So why OpenTelemetry? Why not just keep doing the logging thing?

Alex
Oh, that’s a good question. In some ways, we at Heroku are an example of one such company.

Jon
We have a lot of logs.

Alex
We have a lot of logs. One of the real key things, there’s a lot to OpenTelemetry with the different signals, especially now that you have profiles available to you. The shift from logging, using logs to using traces is a big one. Being able to… like the power of the trace, of being able to carry that context around from service to service is really where it shines. There’s a lot to get wrong in that as well. We’ll probably talk a bit about that later on, but that’s where things start to shine. It’s like being able to trace down and see where the problems occur and connect the dots, because if you’re an executive, especially, and you’re looking at a holistic view of where things are failing in an incident, even if you’re not an exec, if you’re an engineer, you’re trying to debug what’s going on in the middle of an incident, it’s nice to be able to see a waterfall.

Jon
Yeah.

Alex
And you can see where something might be broken, or you can start to detect like, oh, this is taking a lot longer at this particular spot or this call out to a service. Those are definitely one of the things that I think is well worth the effort.

Jon
Yeah, I agree with you on that. It’s already started. I’ve already started thinking about things in terms of traces more than doing something like in New Relic, for instance, or whatever, because the OpenTelemetry traces are just as good in my opinion, in some instances, so.

Alex
Yeah. Well, the other good point about OpenTelemetry, it’s vendor neutral.

Jon
Oh yeah. Yeah that’s true.

Alex
It’s a big deal to be able to just send your data to any vendor that supports OpenTelemetry. So, like being able to… you’re not kind of just like stuck on one vendor and their particular way of doing things. You can follow the OpenTelemetry semantic conventions, which means shifting from one vendor to another is probably going to be a bit easier as well. One vendor can implement a generic dashboard implementation, and maybe that’s good enough. But if you decide to switch vendors down the road, you can do that. And they likely have a dashboard that’s built on the semantic conventions as well. And so you gain the same sorts of information.

Jon
Right. So, what was one of the biggest challenges you had in rolling out OpenTelemetry in your Heroku projects? Any tips for others looking to do the same?

Alex
One of the big challenges we still face today is consistency.

Jon
Right.

Alex
And what I mean by that is there’s a certain set of attributes that you would typically put on your telemetry data, like service name. Being able to be consistent with those things, and I think the semantic conventions helps a lot with that, but having a consistent way of framing how the services are named goes a long way. And also, when you’re digging into a trace, if there is a consistent way in which the function calls or the outbound requests are named, it helps a lot. Common sets of attributes. It just makes it possible… I mean especially within Heroku, right, we have many different teams. And in a big incident it’s oftentimes very useful to have several engineers, especially if the incident is spanning multiple teams, being able to go into your observability backend and just start querying away and being able to grok and understand the telemetry data that’s there, because it’s all got a consistent look and feel. And I think that’s a huge challenge. It’s very easy for teams to make up their own things. Subtle differences between like using a hyphen versus a dash, like a hyphen versus an underscore in a name. Yeah, that’s where things get tricky.

Jon
Yeah, I agree with you on that. Consistency is really important. And discoverability. So, in your opinion, what’s the future of observability and monitoring in software development, and how does OpenTelemetry play a role in the future?

Alex
I think that, you know, observability is going to have a big, big, big factor, a big, big future in software. Especially AI. Maybe it’s going to be a bit trendy, but I think it’s here for good. So, we’re going to have a software written by AI, and as an engineer, being able to understand and track what’s going on in that software is going to be crucial when it fails.

Jon
Absolutely.

Alex
So, AI could probably get you so far in helping trying to debug that, but you’re going to need a human to be able to kind of actually connect the dots finally. So, I feel like observability and monitoring, it’s only going to grow. And I think that it’s going to become more and more important, especially as like your little sort of vibe coded side project becomes big and you’re making a ton of money off of it, and you’ve got customers who are depending on it, and then suddenly it goes down. What do you do? So yeah, I think it always is going to come back, and trust is a big factor there.

Jon
Yeah, it’s huge. So, in 2026 we’re going to see a new huge Marvel movie release in the film Avengers: Doomsday. Lots of fan theories are spinning around on the internet. And just to make everyone feel really comfortable here, Alex and I have no inside information and we don’t know of any actual film spoilers, so there’s nothing really to be concerned with that here. That said, one still open thread is the whole Kang Dynasty plot that was set up in Loki and then Quantumania and Loki 2, I don’t think I missed anything, I think that’s all the movies he was in. And so, my question for you, Alex, is this: will Marvel close the loop here or shuffle on to Doomsday as if nothing ever happened with Kang? Would you want to see some explanation here or just moving on? Is that fine?

Alex
Oh, there’s the curveball question. I’ll admit I’m not much of a Marvel person.

Jon
Okay.

Alex
But I have watched the Loki series.

Jon
Okay, okay.

Alex
I am familiar with the topic. That is the kind of question I’d want to go and talk to my son about, but I’ll wing it.

Jon
Sounds good.

Alex
It is a big open-ended plot twist there with Kang. There could be countless versions of him. And yeah, the threat was there that he’ll come back and come back with a vengeance. Multiple copies.

Jon
Yeah. So, in Ant-Man: Quantumania, there was a bunch of Kangs at the end. I mean, I think one way you could solve this fairly quickly is if you had a scene where, like, Doctor Doom walks in and he could just say, like, I conquered the council or something and he twirls his mustache and laughs maniacally at the camera and it fades out. Or you can even have like a camera pan over a multiverse dog or cat Kang in a dog kennel doing like a sad face. Anyway, you could do it quickly I’m just saying.

Alex
It’s possible.

Jon
I’m really curious if they’re even going to do anything at all. They might not. They just might, just like, move on entirely and say nothing about it. I’m really curious how that’s going to work. I guess we’ll know in a couple years.

Alex
Yeah, I think you’re right, though. I think they’re probably going to just move on a little bit. But they’ve definitely left themselves with a big story that they could come back to.

Jon
I agree. All right, so recently you posted an article to the Heroku blog titled OpenTelemetry Basics on Fir. And if folks are looking for a good intro to OpenTelemetry, I recommend they read it, as it takes you on a journey all the way through setting up a Go app with a Grafana Cloud instance for metrics. And it also shows the power of what you can do with metrics and collect for free on our new Fir platform, and my question for you, Alex, is this: at what point in the Fir process did you realize that Fir can get our customer’s metrics on their apps, such as like CPU usage and memory usage for free, like for everybody?

Alex
Well, that was right from the beginning. That was part of the vision of that whole product. So, I was researching what we could do and thinking about the telemetry product idea. One of the big gaps that I had, and kind of like seen across the industry, is that no one had really built it natively into their platform. Yes, you could use it in their platform. And maybe they had offerings where like it tied pieces together, but you still had to glue it all together and you still didn’t really gain insights into the platform itself. And that was one of the things that I wanted to make available. I think we talked about it before, trust is a big key. Folks want to understand what’s going on. And so right from the beginning, I wanted to make sure that we could bake certain things into the platform so that as a customer, you don’t even think about it. It’s just there for you.

Jon
Yeah. Awesome.

Alex
More than just CPU, memory, and usage too. The trace, starting a trace from the router. So, the goal was to get enough telemetry data into Fir so that as a customer, without having done anything, you would have enough data to diagnose a problem. And that was kind of like a key mission.

Jon
Yeah. It’s incredible. That’s really, really cool about that. So finally, as we’re wrapping up here, thank you for talking to me today. What’s next for OpenTelemetry at Heroku? Anything coming that you’re really excited about?

Alex
Oh, there’s a lot of things that I’m really excited about. I probably can’t talk about all of them.

Jon
Oh, okay.

Alex
But one of the things I’m really excited to get out the door and it’s going to be probably a bit of a game-changer in terms of helping to get set up…

Alex
Oh?

Alex
…is third party integration with our add-on marketplace.

Jon
Oh, okay.

Alex
So right now, you kind of have to read through the documentation if you’re using Fir. You need to read through our docs. If you’re using a vendor like Grafana or Honeycomb, you’ll need to read through their docs and put the two pieces together and figure out how to configure your telemetry drain manually. But we’re working on doing the marketplace. This would be a single one-liner. It changes it from having to understand and manage tokens and auth credentials and headers and things like that, to just a simple Heroku add-ons create Grafana planning. And then all of the magic happens on the back end to set up that telemetry drain.

Jon
Awesome. Oh, that sounds great. I’m looking forward to that. And I think that’s all the questions I had today, Alex. Thank you so much for talking to me and talk to you soon.

Alex
Yep. Talk to you soon. Thanks for having me.

Narrator
Thanks for joining us for this episode of the Code[ish] podcast. Code[ish] is produced by Heroku. The easiest way to deploy, manage, and scale your applications in the cloud. If you’d like to learn more about Code[ish] or any of Heroku’s podcasts, please visit heroku.com/podcasts.

About Code[ish]

A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.

Subscribe to Code[ish]

Hosted By:

Jon Dodson

Software Engineering LMTS, Heroku

with Guest:

Alex Arnell

Software Engineering PMTS, Heroku