Search overlay panel for performing site-wide searches

Build Your Next Big Thing on Heroku. Sign Up Now!

OTel in the Age of AI

TAGS

  • Tools and Tips
  • AI
  • OpenTelemetry

OTel in the Age of AI

Austin Parker, Director of Open Source at Honeycomb and main contributor for the OpenTelemetry project, joins Julián Duque to discuss how observability is evolving in the age of generative AI. From OTel’s origins to the development of new semantic conventions for LLMs, Austin explores how standardized telemetry provides the visibility needed to keep complex, AI-powered applications running smoothly.


Show Notes

Julián
Hello hello and welcome to Code[ish], The Heroku Podcast. My name is Julián Duque, Principal Developer Advocate for Heroku and your host here at the Code[ish] Podcast. And with me today we have Austin Parker. He’s the Director of Open Source at Honeycomb and main contributor for the OpenTelemetry project. Hello Austin. How are you doing?

Austin
I am doing great. Thanks for having me.

Julián
Thanks for joining us here at the Code[ish] Podcast. And we are fans of OpenTelemetry. We recently added support to OpenTelemetry on our next generation of Heroku, our Fir runtime. And that has been an opportunity for me to start also learning a little bit more about OTel. I’m new to the topic, so I’m, like, eager to learn a lot. And I’m looking forward also for all of our audience that don’t know anything about OpenTelemetry to learn a bit more about it. And also with this new age of AI, what is the role of OpenTelemetry with AI. So let’s start from the very beginning. You were mentioning before we started recording that you were part of one of the kind of like founders of the OpenTelemetry project. Tell me a little bit more about how that started.

Austin
Sure. You know, I think that all really successful open-source projects, success has a thousand mothers or whatever, right? But OpenTelemetry was formed, I want to say like 6 or 7 years ago, by merging the existing OpenTracing project and the OpenCensus projects together. And so that brought together a lot of observability engineers and people working on this problem from companies like Google and Microsoft and Lightstep and open-source projects like Jaeger and Prometheus. And we brought all this together so that we could kind of address this problem in the space.

Everyone needs observability, right? It doesn’t matter what kind of app you’re running. It doesn’t matter, you know, how many users you have. If you don’t know what’s happening in production, you don’t know, you know, how to improve your application. You don’t know how to fix it when it breaks. You don’t know how to make user experiences better. So everyone needs this data. But all of that data was kind of trapped in proprietary ecosystems. You would go pick a vendor or you’d have to kind of roll your own thing. And it wasn’t interoperable, right? If you wanted to change something, if you wanted to switch who you were using to analyze that data, then you have to start over, throw everything out and go from scratch, or you have to spend a lot of time maintaining your own system of telemetry, right? And when we say telemetry, you know, we’re talking about the data from your app and from your infrastructure that says what it’s doing. So your log messages or your metrics or distributed traces or even things that are, you know, lower like continuous profiles and other ways of really inspecting runtime behavior.

So our idea with OpenTelemetry is what if there was a standard? What if there was one way to do this and everyone could kind of agree on it, and all the vendors could agree on it, and then open-source maintainers of other projects or other frameworks could integrate that? And we turned telemetry from this sort of, like, day two problem to something that’s a built-in feature of cloud-native software. And that’s what brought these people together. That’s what brought these projects together. And now here we are, you know, six years later and it’s become, you know, ubiquitous, right? OpenTelemetry is the second biggest project in the CNCF. We’re seeing it being integrated into more and more, both cloud-native software, but also into things like application runtimes. We’re seeing it appear in utilities and tools around AI. We’re seeing it appear in platforms like Heroku, obviously, right? And I think it’s a real validation of that initial mission that we need, that people need this data and it needs to be built in. And so it’s super exciting to see how far it’s come in that time. And I’m really honored to be a part of this project and to help it mature and grow as part of the governance committee.

Julián
That’s beautiful. Tell me a little bit more about the governance side of the open-source project, how that works, how decisions are made within this, because it is a huge open-source project, like, backed by different companies. So, how can you like, get, like, all these people with, like, different interests, like, to agree on something?

Austin
Yeah. I won’t say it’s necessarily easy. It takes a lot of patience. But what I think is really interesting about OTel, and this is something that we started with. We started as multi-stakeholder, right? Like it wasn’t just, oh, these people over at Lightstep have a good idea and they’re going to bring other people in. You know, we started by getting all these different companies that kind of had an interest here, and had prior art of not just selling, you know, observability platforms, right? Like, Google and Microsoft were coming into this project not because, oh, they, you know, they have some big investment or stake in it, but because they have people that have solved these problems for planet-scale software. And I was joking with someone about this last night, but it’s almost like Kubernetes, where there’s a lot of things in Kubernetes that feel annoying or difficult or weird and nobody really ever thanks you for solving problems that they will never have because of this.

Julián
Yeah, yeah.

Austin
And OTel is very similar in a lot of ways, right? Like we brought together all this expertise originally that was focused on how do we kind of do this correctly, how do we solve all of these problems that maybe some of them only show up when you get to, you know, a million requests a second? How can we build a system that is scalable from, you know, very small, you know, hobby projects all the way up to, again, planet-scale software, and I think that focus has been really good for making a pretty resilient framework.

Now, the flip side of that, and I think where we get into the interesting governance stuff is it’s very challenging to prioritize with that because there are a lot of stakeholders that operate observability platforms. And they want, you know, it’s like, oh, we want to make it easier to use for customers. We want one-click install. We want dashboards out of the box and all this. And so we have to balance kind of those requirements and those asks from a governance perspective to help focus the project on our 2 or 3 groups of users. And one of those is definitely our integrators. So, people that are integrating OpenTelemetry into their cloud-native software or framework. The second are sort of like systems operators, people building platform teams and kind of building telemetry platforms for enterprise at scale. And the third is definitely that sort of application developer, that end user, who needs to add in tracing to, you know, an existing e-commerce app or something, right? Like we need to balance all three of those.

And I think the big challenge in governance for us always is figuring out, okay, what’s the ratios there, right? We can’t do everything for all three of them all the time. We have to kind of prioritize things and those are the conversations that make governance challenging but also fun, right? It’s not just a clear golden path to the right answer. You have to go and chat with a lot of people. One of the reasons I love, you know, KubeCon and I love being able to come to these events in person is to meet our users in person and say like, hey, what’s working? What’s not working? How can we do better? I love being able to get that feedback directly and it’s so helpful in shaping where the project, you know, is going in the short term and long term.

Julián
Nice. And how big is the governance body and the project in general in terms of contributors or also the sponsoring companies behind the project?

Austin
Yeah, so for governance, we have a kind of a federated model where there is a central elected governance committee of nine people. There’s a technical committee that sort of handles a lot of the high-level direction of the specification. I’m not sure if there’s a great analogy for it in other projects. But you can sort of think of the GC as the product owners, as it were. And the TC is sort of the architects.

Julián
Yep.

Austin
But then we also leave a lot to sort of the SIGs, right? So the way, you know, the flow is supposed to work is we have a specification, we write things in the specification and then the Java SIG will take that and they will implement the spec for their language. And the Ruby SIG will implement the spec for their language. And so in those language SIGs, you know, we have a contributor ladder, so you go from, you know, contributor to triager, approver, maintainer. And when you’re kind of a maintainer for a SIG, then, we have some people that want to stay there. They want to focus on, you know, just my language or just this technology. Some of them want to kind of graduate up and move more into these project-wide things working in the spec. I don’t think we quite have 100 maintainers across the SIG but we’re getting there. Over the last 12 months, we have about 1500 kind of unique contributors across, I think around 300 companies. I’d have to go check my math. And about 50% of those contributing companies are actually end users as well, right?

So it’s not just the vendors coming in or the big, you know, the big people coming in and doing all this work, we do get a lot of contributions from our end user community, and integrate those in and go from there. So, that’s been pretty exciting to see that growth. You know, over the course of the project, it’s been a pretty linear growth in overall contributors, which is exciting and honestly a little scary. We’d expected it at some point to plateau, but it really hasn’t. Every time we do a KubeCon and we look at the graph, it just keeps going up and to the right. So, that’s pretty exciting.

Julián
That’s wonderful to see. Like the growth, the adoption, the evolution of the project. You have been like since the… since the beginning, right? Or you joined a bit after they started?

Austin
So, I was an OpenTracing maintainer.

Julián
Okay.

Austin
And there was about 20 or 30 people that kind of, you know, the hidden story I want to say is, like, late 2018, there was some internal conversations that happened between the OpenTracing and OpenCensus maintainers about bringing the projects together. And then we reached out to the CNCF to get, sort of an independent, like a third-party mediator in these meetings, and that led to, you know, sort of a two phase approach where we kind of had one group of people over here doing the sort of governance chartering and figuring out how the project was going to work. And then we had another group over here that was doing sort of the technical evaluation of, okay, how can we bridge these?

Because OpenTracing and OpenCensus were similar in their goals, but the way that we actually got there was very different. So we had to figure out what is sort of the technical design of it? And I started out a little bit on that side and on sort of the community side doing things like documentation and our website and stuff like that. And then over the years, contributed to several SIGs including like our .NET, our demo project, a few other things here and there. Started about four years ago, kind of got into a community management role, and I’m hoping to do a lot of events and evangelism. Before my current role, I was actually working in developer relations.

Julián
Oh, nice!

Austin
So I was a principal developer relations advocate at Lightstep and then ran a little team there, and so did that for a while. And two years ago now, I ran for OpenTelemetry Governance Committee and got elected, and so since then, I’ve been focusing on that high-level project governance.

Julián
Beautiful. And since the beginning, since the inception, and today, how has been the evolution of the spec, of the project? Do you think it has changed a lot? It has like adapted to the needs of the ecosystem? Or it has maintained like a pretty much a stable…

Austin
I think the core has stayed pretty stable. The, you know, one thing we don’t get a ton of chances to talk about is, you know, the design of OpenTelemetry and what we’re really trying to encourage, because it’s very interesting in the observability space is that you can go and you can set… you can have all these great ideas about how things should work. And you can encode those into a specification like we’ve done with OpenTelemetry. But, until end users kind of get tools that work with those same semantics, it’s hard to really get people to grasp the vision. And one of the things that OpenTelemetry does that’s pretty unique is that we’ve always been tracing first, right?

We don’t say, oh, you don’t need metrics, oh, you don’t need like … we say you need all of these things. But instead of thinking of it as three independent pillars, it’s a single inter-correlated braid of data. And so you need tracing because you need sort of that context for all of these signals to sort of bind to. And you create traces, you create metrics, you create events, you create all of these things and link them together, and then your analysis tools should be able to be aware of those links and be aware that, hey, this thing is also that thing and that, you know, this time series metric actually has exemplars that I can go in and I can look at specific traces from this point in time, and I can see …

Vice versa I can say, okay, for this specific trace, what metrics are being recorded at the same time, right? And you can really easily pivot between these. And for the most part, you know, tools are catching up, I would say. But there’s sort of a push pull where we’re pushing stuff out into the world through the spec, and then people are kind of pulling that into their products and into their… either their open-source tools or into proprietary platforms. They’re pulling those ideas in. So I think we’ve maintained that core idea, right? What has changed is, if anything, we’ve kind of expanded our scope a little bit.

A good example of this is around, I said events a minute ago. Originally, our idea with logging in OpenTelemetry was just, you know, oh, you already have a logging API, we’ll just give you a way to bridge your logging API into OpenTelemetry, and then we will add in that OpenTelemetry context to your existing … to the logs that are flowing through the API. And as part of our process we were going through and we had people creating these conventions, these semantic conventions for telemetry metadata. And one of the things that we kept hearing was there’s things that we need to represent, that we need to write semantics for that don’t fit into the current model, they’re not really span metadata. They’re not really time series attributes. We need some … something log shaped. And so we, we’ve gone back to the table and said, okay, you know, let’s figure out how to create sort of an events API and a way to define these structured events, these semantic events or structured logs or, you know, we’re calling them events. But it’s the same concept, right? Like, by being very spec-driven, we’re able to kind of go write stuff down, see how that works when people try to implement it, and then we can adjust the spec as we need to, based on that feedback from implementers.

Julián
Beautiful. And with the rise of generative AI and all this new world, do the spec have adapted to the new needs of AI applications? Or pretty much it was everything right there and it was easy for like, the implementers to start using the spec just to create traceability for AI apps?

Austin
Yeah, I think, there’s kind of two parts to this. So one is how suitable was the current spec in the sort of the current semantics to building apps with GenAI? And then the second one is, what are the needs of sort of building apps that run on or are powered by GenAI? And in the first case, what we’ve generally seen, what I’ve seen personally is, because OpenTelemetry has been out there long enough and it has strong documentation and a lot of code examples, you go tell Claude or ChatGPT or whoever, like, hey, add OpenTelemetry to this and it’ll do a pretty good job.

Julián
Yep.

Austin
Right? Just goes and does it. The second one I think is more interesting because there’s sort of two different parts of this. There’s people that are building applications but that are sort of powered by GenAI, right? So chat experiences or things that are writing code for users and running jobs in sandboxes and all of that.

Julián
Agents.

Austin
Agents, right. And so for that the way that we’ve been kind of adapting to that is we have the semantic conventions process that says, well, what is the metadata that you need? Not just what do you do, when you’re writing an agent? What kind of telemetry do you need? You know, you need traces for this. You need metrics for that. But what are the characteristics? Like what are the attributes you care about? You care about, you know, how many tokens are being used. You care about…

Julián
The first time to token.

Austin
First time to token, right? You care about how many tool calls and the ways that that needs to be represented. Because at the end of the day, that’s the sort of stuff that you’re going to look at, and you need to have a schema for what you’re going to look at. And that’s what our generative AI semantic conventions process has been doing. So, we’ve had a lot of people working on that for the past two years, really. Now we’re seeing quite a bit of adoption and uptake of those. We have instrumentation plugins for pretty much every AI API under the sun at this point. And one thing that’s been really exciting to me personally is that since GenAI is sort of the first big new thing that’s happened since OTel kind of came around is we’re seeing a lot of the people building AI tools just starting with OTel by default.

So, if you go and you, you know, get a coding agent, you know, like Codex or Claude Code or whatever, those have built-in OTel support, which is really cool. So if you’re, you know, using Claude Code, I do this personally all the time, you can add in the environment variables in your, you know, in your environment to tell Claude Code, hey, send traces about what you’re doing to an OTLP Receiver, to an OpenTelemetry Protocol Receiver, right? So send it to whatever, send it to Jaeger, send it to OpenSearch, send it to Honeycomb, Datadog. Who cares? Like, and then you can see what the agent is doing, right? Like you can have that history and you can see how many tokens am I using and how much is it costing and all that stuff. So, it’s been really great to see that kind of like, oh, people are actually using this and it’s, you know, they’re using it well, I would say.

Julián
Yeah, I was trying one observability platform for AI and the install options were like, okay, use our connector or just set up OTel with these steps. And it was like out-of-the-box. So, that’s great for the ecosystem that they are able just to use what is standard and available and they know it.

Austin
And we’ve also seen, because of OTel, like there’s so many, you know, we’re at KubeCon right now and you go down to the expo floor and start throwing a rock, and you’re going to hit an observability tool that supports OTel, right? Like, it’s something that we definitely talked about when we started the project was, we will know this is successful if people are able to kind of innovate and build businesses and build applications on top of this framework, right? And I think we’ve really seen that happen. Like that’s been proven out. There’s a lot of companies that have kind of, you know, little startups that have gone from, you know, five people to much larger than that now, that are building on top of the framework that OTel provides, right? And doing the analysis stuff. We’ve seen a lot of advances in like, telemetry databases, right? For people that are building their own observability stack. And they’ve been able to kind of leverage OTel to give people this nice, smooth path to, you know, get in and optimize for it. It’s like Kubernetes in a lot of ways, right? Like, Kubernetes freed people from having to kind of go solve all of these lower level problems so that they could start building stuff on top of it and get value more quickly. And I think OTel does much the same thing.

Julián
And from your experience, and with AI, there are like many things you can do. So one, it’s using, like, assistants to set up OTel for me, like automatically ask Claude or Codex, do this for me. The other one is using OTel to measure and capture all of that traceability and observability metrics from AI. But what’s after that? Is there anything that the ecosystem is doing with all the data?

Austin
Yeah.

Julián
And AI like for OpEx, sort of things like that.

Austin
Yeah. So, I wrote a blog post earlier this year that got some attention and it was called “It’s the End of Observability As We Know It And I Feel Fine.” But I posited something in there. So, a little backstory. Since about February of this year, I’ve been working on our AI team at Honeycomb, helping build out our AI platform and one of the things that I pretty quickly realized is that a lot of the current value that an observability tool provides to users is taking this stream of data and interpreting it for you. So, you know, you download whatever and you click a button and it’s like, oh, here’s your pre-configured dashboards, and here’s your pre-configured alerts. And a human being had to sit down and say, hmm, what’s the interesting things that someone needs to know if they have this library, right?

And I believe that GenAI really makes all that irrelevant because why do I need a person sitting down and making a dashboard for me when the AI can just look at the data that’s coming out of my system and create the dashboard in real-time, right? Why do I need these hyper-specialized observability tools that focus solely on observability data when a lot of the value in observability is connecting that sort of performance telemetry with other things like user events, real user monitoring sessions, you know, information in my business intelligence catalog, right? So I see a future where because OTel is focused so much on schemas, so much on the discoverability, has this very defined structure, has a way for applications to kind of publish telemetry schemas that describe what the telemetry that they’re emitting means.

I see kind of a very agentic future for observability, where you have these observability agents that are part of some greater whole, and they’re able to kind of discover these summary schemas about your application and say, okay, well, now I know what everything does and what it means, and I’m just going to sit here and when someone says, hey, what’s going on with my service in prod, it just goes and builds a dashboard for me, right? It builds a view for me in real time. And there’s a lot of like, super interesting questions that come out of that about how you give that data to the agent and so on and so forth. I had a talk about this earlier this week about, you know, what’s the best way to show an LLM a trace?

Julián
Yeah.

Austin
And it’s actually fascinating because it turns out the best way is actually to give it ASCII art.

Julián
ASCII. Okay, yeah.

Austin
Yep, you draw the little… the trace out with little ASCII connector lines and it can figure it out. And things like that that are just maybe, not super obvious and also things that are changing every six months, right? Like, you know, I have the answer that works really well right now. In six months, we might see another leap in model performance and we have to change all this again. You know, who knows. We’re so early with a lot of this AI stuff. And there’s a certain type of person that’s out there saying AI is going to lead to the singularity or whatever, and we’re all going to be out of a job, and AI will do everything. And then there’s people on the other extreme that say, oh, that AI is a fad. It’s a hoax. I personally, I am here as a member, a founding member of the radical AI centrists.

A lot of people right now overestimate what AI is capable of today. And simultaneously underestimate where AI is going to be in ten years. And there will come a point where like advances will plateau, right? Like I think personally right now, the next big thing is not necessarily the models get better or remarkably better, but we get better at integrating the models. We get better at software harnesses, right? We get better at making agents. We get better at using the right type of model for the right task, and not just throwing everything at these very high power general models. And that’s going to be a process, right? None of this stuff happens overnight, but it is happening. And I think GenAI is going to dramatically reshape how we work, how we build software, how we operate software. And we need to be exploring it and thinking about it, and not either writing it off as a fad or getting caught up in these flights of fancy about what I can do because at the end of the day, you know, your job is keeping the software up, not the AIs, right?

Julián
Exactly. There is an ocean of opportunities out there. And especially what you told me, it starts giving me like, ideas. Okay. This is the, like the things that I kind of start, like, doing and researching. So, awesome, and I hope a lot of the people that are going to listen to this episode get inspired.

Austin
Hopefully. I would like for people to be inspired rather than terrified, you know?

Julián
Exactly.

Austin
I think it’s such a super cool space. Like, I can honestly say, you know, I’ve been using computers since I was four and programing since I was like a child, right? And this is the most exciting thing since, like, the World Wide Web to me.

Julián
I feel the same. I’ve been passionate about software development for a while. And in a point I kind of like lose that spark.

Austin
Right. It’s got… it got boring. The cloud, the cloud made everything kind of boring. But it’s back! We’re back baby!

Austin
We’ve made the universal function approximator and it do be approximating functions universally.

Julián
Beautiful. That is amazing. Austin, thank you so much for joining us, sharing your experience, sharing your passion for OpenTelemetry. And now this new perspective of how can we rely on AI to build the new future?

Austin
Well, thank you so much for having me. This was a ton of fun. If people would like to keep talking with me about this, you can find me on Bluesky at @aparker.io, or look me up at a KubeCon near you.

Julián
Awesome. Thank you so much, and looking forward to have another awesome episode of Code[ish] and see you on the next one. Bye bye.

About Code[ish]

A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.

Subscribe to Code[ish]

This field is for validation purposes and should be left unchanged.

Hosted By:
Julián Duque
Julián Duque
Principal Developer Advocate, Heroku
@julian_duque
with Guest:
Austin Parker
Austin Parker
Director of Open Source, Honeycomb